This repository provides a high-performance C++/CUDA application for performing multiple vision tasks—such as object detection, semantic segmentation, depth estimation, and optical flow—using Meta's DINOv3 as the backbone with NVIDIA DeepStream SDK. A key advantage of this approach is that the DINOv3 backbone features are computed only once (the most computationally demanding step), and these shared features are then reused by lightweight task-specific heads. This design significantly reduces redundant computation and makes multi-task inference more efficient. In this version, inference is performed using NVIDIA DeepStream's TensorRT integration for maximum throughput.
This project complements dinov3_ros, providing a DeepStream-based alternative for production environments requiring maximum performance, hardware acceleration, and integration with NVIDIA Jetson or GPU-accelerated video pipelines.
- Real-time multi-task inference: Run detection, segmentation, depth, and optical flow simultaneously
- Efficient backbone sharing: DINOv3 features computed once and shared across all tasks
- Hardware-accelerated pipeline: Full GStreamer/DeepStream pipeline with CUDA/TensorRT
- Flexible input sources: Camera (V4L2), video files, RTSP streams, or generic URIs
- Display modes: Separate windows or tiled view for all inference heads
- Low latency: Optimized for real-time video analytics applications
- Configurable: Enable/disable tasks, adjust visualization, debug pipeline
-
Install CUDA Toolkit
Follow the NVIDIA CUDA installation guide.
-
Install DeepStream SDK
Download and install from NVIDIA DeepStream.
-
Install GStreamer development libraries
Follow the GStreamer installation guide
git clone https://github.com/Raessan/dinov3_deepstream.git
cd dinov3_deepstream/dinov3_deepstream
mkdir build && cd build
cmake ..
make -j$(nproc)The compiled binary will be located at build/dinov3_deepstream.
You need to obtain model weights for both the DINOv3 backbone and the task-specific heads:
-
DINOv3 Backbone: Request and download weights from the official DINOv3 repo. Export the model to ONNX/TensorRT format compatible with DeepStream.
-
Task-specific Heads: This repo contains the ONNX models of each subtask. They can also be obtained from the following repositories (trained with
vits16plusbackbone):- Detection: object_detection_dinov3
- Segmentation: semantic_segmentation_dinov3
- Depth: depth_dinov3
- Optical Flow: optical_flow_dinov3 Users are encouraged to improve the performance of any task by training and using their own ONNX models!
Docker support with NVIDIA Container Toolkit is available for simplified deployment.
- Install the NVIDIA Container Toolkit on the host machine.
docker compose build
docker compose upAccess the container:
docker exec -it dinov3_deepstream bashRun the application from the build directory:
./dinov3_deepstream [OPTIONS]--source-type TYPE: Input source type:camera,file,rtsp,uri(default:camera)--source-uri URI: Source URI (device path, file path, or stream URL)- Camera:
/dev/video0 - File:
/path/to/video.mp4or./video.mp4(absolute or relative paths) - RTSP:
rtsp://192.168.1.100:8554/stream
- Camera:
--framerate FPS: Frame rate for processing (default:30)--display-mode MODE: Display mode:separate,tiled(default:tiled)--do-depth [true|false]: Enable/disable depth estimation (default:true)--do-detection [true|false]: Enable/disable object detection (default:true)--do-segmentation [true|false]: Enable/disable segmentation (default:true)--do-optical-flow [true|false]: Enable/disable optical flow (default:true)--debug [true|false]: Enable debug mode with pipeline visualization (default:false)--dot-file PATH: Path for pipeline DOT file (default:./pipeline)--config CONFIG: Path to DINOv3 config file (overrides default)-h, --help: Show help message
Run with USB camera (all tasks enabled):
./dinov3_deepstream --source-type camera --source-uri /dev/video0Process a video file with only depth and segmentation:
./dinov3_deepstream --source-type file --source-uri /path/to/video.mp4 \
--do-detection false --do-optical-flow falseStream from RTSP source with tiled display:
./dinov3_deepstream --source-type rtsp \
--source-uri rtsp://192.168.1.100:8554/stream \
--display-mode tiledDebug mode (generate pipeline visualization):
./dinov3_deepstream --debug true --dot-file ./debug/pipeline
# Convert DOT file to image:
dot -Tpng ./debug/pipeline.dot -o ./debug/pipeline.pngModel inference settings are configured via text files in dinov3_deepstream/configs/:
config_infer_dinov3.txt: DINOv3 backbone configurationconfig_infer_depth.txt: Depth head configurationconfig_infer_detection.txt: Detection head configurationconfig_infer_segmentation.txt: Segmentation head configurationconfig_infer_optical_flow.txt: Optical flow head configuration
These files specify model paths, input dimensions, layer names, and TensorRT engine parameters. Update them according to your model files and requirements.
Meta has only released model heads for the large ViT-7B backbone, so for smaller backbones we trained task-specific heads (each < 5M parameters) in separate repositories to achieve good precision. Our goal was not to beat SOTA models, but to provide a lightweight, plug-and-play toolkit.
Each task is implemented as a DeepStream probe that processes the inference output and performs visualization. The backbone produces shared features that are fed to all task-specific heads, minimizing redundant computation.
Object detection using a lightweight FCOS-style detection head. Outputs bounding boxes with class labels and confidence scores.
Check the following repo: object_detection_dinov3
Implementation: src/probes/dinov3_probe.cpp Parser: src/custom_parsers/nvdsinfer_custom_detection.cpp
Pixel-wise classification producing semantic masks. Includes a custom colorizer for visualization with class labels.
Check the following repo: semantic_segmentation_dinov3
Implementation: src/probes/segmentation_probe.cpp CUDA kernels: src/utils_cuda/segmentation.cu
Monocular depth estimation producing metric depth maps. Visualized as colored depth maps with configurable near/far range.
Check the following repo: depth_dinov3
Implementation: src/probes/depth_probe.cpp CUDA kernels: src/utils_cuda/depth.cu
Dense optical flow estimation between consecutive frames. Visualized as colored flow fields using HSV color encoding.
Check the following repo: optical_flow_dinov3
Implementation: src/probes/optical_flow_probe.cpp CUDA kernels: src/utils_cuda/optical_flow.cu
The application uses a GStreamer pipeline built with NVIDIA DeepStream components:
Source → nvstreammux → nvinfer (backbone) → tee
├→ nvinfer (depth) → probe → sink
├→ nvinfer (detection) → probe → sink
├→ nvinfer (segmentation) → probe → sink
└→ nvinfer (optical_flow) → probe → sink
Key components:
- Source:
v4l2src,uridecodebin, orrtspsrcdepending on input type - nvstreammux: Batches frames for inference (batch size = 1 by default)
- nvinfer (DINOv3 backbone): Runs once to extract shared features
- tee: Splits the feature stream to multiple task heads
- nvinfer (task heads): Lightweight inference for each task
- Probes: Custom GStreamer probes for post-processing and visualization
- Sinks: Display outputs (separate windows or tiled mosaic)
The pipeline builder (src/pipeline/pipeline_builder.cpp) dynamically constructs the pipeline based on enabled tasks.
| Feature | dinov3_deepstream | dinov3_ros |
|---|---|---|
| Framework | NVIDIA DeepStream + GStreamer | ROS 2 |
| Language | C++/CUDA | Python |
| Use Case | Production video analytics, edge devices | Research, robotics integration |
| Latency | Lower (hardware pipeline) | Higher (Python overhead) |
| Deployment | Standalone application | ROS 2 node ecosystem |
| Flexibility | Fixed pipeline | Topic-based composition |
Both projects share the same task-specific head models and DINOv3 backbone weights.
- Code in this repo: Apache-2.0
- DINOv3: Licensed separately by Meta (see DINOv3 LICENSE)
- NVIDIA DeepStream SDK: Closed-source SDK subject to NVIDIA's terms of use. See the NGC DeepStream collection for license details.
- We don't distribute DINOv3 weights. Follow upstream instructions to obtain them.
