
Project realized by Evan Galli, Jilian Lubrat, Eliot Menoret and Antoine-Marie Michelozzi
as part of the Autonomous Intelligent Systems and Edge computing & Embedded AI courses.
- Person recognition: The bot exclusively recognizes and obeys commands from its "master".
- Hand-gesture control: Use hand signals to pair with the bot and direct its actions.
- Person following: The bot actively tracks and follows the "master" until signaled to stop.
- Turtlebot3
- Luxonis OAK-D Pro 3d camera
Ensure that the directory path contains no spaces to avoid any issues during execution.
Install ROS Jazzy and uv to run this project
Then install the following packages:
sudo apt install ros-jazzy-depthai-ros ros-jazzy-turtlebot3-navigation2 ros-jazzy-turtlebot3-description python3-gz-transport13 python3-gz-msgs10Setup the Luxonis OAK-D Pro camera by following the instructions here.
To use ROS Humble, you must update the following files:
Makefile: SetPYTHONPATHto.venv/lib/python3.10/site-packagesin the various instructions.pyproject.toml: Update therequires-pythonversion to==3.10.12..python-version: Change the Python version to3.10.12.src/turtlebot3_launch/launch/common_computer.launch.py: Change theparams_fileofnav2_bringuptonav2_params_deployed.yaml.
make simto start the project in simulation modemake deployto start the project in a deployed environmentmake realto start the project on a computer connected to the Turtlebot3make sim_teleopto manually control the robot with keyboard inputs in the simulationmake teleopto manually control the real robot with keyboard inputsmake sim_gesturesto change the person's image displayed in the simulationmake cleanto clean the workspacemake mass_shootingto kill all remaining simulation processes
Do not run the simulation on a Raspberry Pi, use a standard PC instead for better performance
The camera_reader is the package responsible for:
- Launching the camera and its associated AI model.
- Processing camera and model outputs directly on the device.
- Running the gesture detection model.
- Processing the gesture detection model’s output.
- Publishing the target point to a ROS topic.
- Publishing the detected gesture associated with the image.
This is the primary package for all image-related process management. Everything is integrated within the same package to avoid transmitting raw images over ROS topics, which would be too resource-intensive to ensure the high performance required for real-time gesture detection inference.
Detailed Workflow:
-
Initialization: The package's first role is to establish a connection with the camera and create the data pipeline. This pipeline includes:
- An acquisition node for both color and depth images (using stereo cameras).
- A neural network node running YOLOv11 Nano Segmentation to detect people in the frame.
-
Data Processing: The pipeline outputs model data, which is processed by a set of utility functions and a yolo_api to determine bounding boxes and segmentation masks. The raw image, depth map, and 3D point cloud are also retrieved.
-
Gesture Detection: A specific person is cropped from the output image based on their bounding box. This cropped image is then resized via letterboxing to maintain its aspect ratio and fed into a YOLOv10 model to identify the specific gesture being performed.
-
Spatial Localization: Simultaneously, the depth image is coupled with the segmentation mask to calculate the distance to the person. This defines a point in camera space, which is then converted into map space coordinates via a coordinate transformation.
-
ROS Outputs: Finally, the coordinates are sent to the
robot/goal_pointROS topic as a PointStamped message, and the gesture is sent togesture/detectedas a String. These are then used by the orchestrator.
This package contains the same logic as the camera_reader package, re-implemented specifically for simulation environments. Since simulation does not involve physical hardware, the sections dedicated to camera connection and internal hardware pipeline creation are omitted.
Key Differences:
- Data Source: The YOLO model is loaded locally, and inference is performed directly on the image streams received via the
/rgb_camera/imageand/depth_camera/imageROS topics from the simulation. - Pipeline: There is no pipeline sent to an external processor (such as an OAK-D), all processing is handled by the simulation computer's CPU/GPU.
The remaining functionality, including mask processing, gesture detection, and coordinate transformations, remains strictly identical to the physical version.
The pilot package is responsible for sending a PoseStamped message to Nav2 to calculate the path required to reach a destination. This PoseStamped is received via the /pilot/goal_point ROS topic.
The Screen Manager updates the virtual displays in the simulation, providing visual feedback for the currently selected gesture.
It uses gz-transport13 to communicate with the Gazebo world, allowing it to swap screens in real-time without restarting the simulation.
The manager toggles between three specific visual states:
| State | Image Displayed | Context |
|---|---|---|
| Normal | video_screen_0 |
No specific gesture detected. The robot is in "standby" or looking for its person. |
| Start Following | video_screen_1 |
The "Fist" gesture is recognized. The robot confirms it is now tracking the person. |
| Stop | video_screen_2 |
The "Stop" gesture is recognized. The robot confirms it has halted all movement. |
To change the image, the package "teleports" the desired screen to specific coordinates within the simulation's field of view, while hiding the others below the ground plane (
ROS 2 provides the concept of managed nodes, also called LifecycleNodes. Unlike regular nodes, a LifecycleNode follows a predefined state machine that allows its execution to be explicitly controlled at runtime.
A managed node transitions between steady states (unconfigured, inactive, active) using explicit transitions such as configure, activate, and deactivate.
This makes it possible to start or stop a node’s behavior deterministically, without restarting the node itself.
In this project, the navigation Pilot node is implemented as a LifecycleNode and is controlled by the Orchestrator.
The Orchestrator centralizes all high-level decision logic:
- It listens to gesture commands via
/gesture/detectedfrom the Camera Reader Node to control the Pilot lifecycle (fistto activate,stopto deactivate). - It subscribes to
/robot/goal_point(Camera Reader) and/odom(Gazebo) and preprocesses navigation goals before forwarding them to the Pilot. - It applies goal update filtering, ensuring that a new target is sufficiently different from the previously forwarded one before being sent to the Pilot.
- It performs the conversion and validation from
PointStampedtoPoseStamped, forwarding only preprocessed navigation goals to the Pilot.
Only when the Pilot is active and the new goal is considered valid does the orchestrator forward a preprocessed PoseStamped message to /pilot/goal_point.
This design keeps the Pilot node simple and focused on navigation execution, while the orchestrator handles gesture control, safety checks, and goal validation.
graph TB
subgraph Camera["Camera Reader Node"]
C1["/gesture/detected"]
C2["/robot/goal_point"]
end
subgraph Gazebo["Gazebo Simulation"]
G1["/odom"]
end
subgraph Orchestrator["Orchestrator Node"]
O1[Listen /gesture/detected]
O2[Listen /robot/goal_point]
O3[Listen /odom]
O4[Process distance check]
O5[Convert Point to Pose]
O6[Publish /pilot/goal_point]
O7[Manage Pilot lifecycle]
O1 --> |fist| O7
O1 --> |stop| O7
O2 --> O4
O3 --> O4
O4 --> O5
O5 --> O6
end
subgraph Pilot["Pilot Node (Lifecycle)"]
direction TB
P_START([Start])
P_UNCONF[UNCONFIGURED]
P_INACT[INACTIVE]
P_ACT[ACTIVE]
P_START --> P_UNCONF
P_UNCONF --> |configure| P_INACT
P_INACT --> |activate| P_ACT
P_ACT --> |deactivate| P_INACT
P_SUB[Listen /pilot/goal_point]
P_TIMER[Timer 1Hz]
P_NAV2[Send to Nav2]
P_CANCEL[Cancel Nav2 goal]
P_ACT --> P_SUB
P_SUB --> P_TIMER
P_TIMER --> P_NAV2
P_ACT -.-> |on deactivate| P_CANCEL
end
C1 --> O1
C2 --> O2
G1 --> O3
O6 --> P_SUB
O7 --> |configure at start| P_UNCONF
O7 --> |activate on fist| P_INACT
O7 --> |deactivate on stop| P_ACT
style Camera fill:#cfe2f3
style Gazebo fill:#fff2cc
style Orchestrator fill:#d9ead3
style Pilot fill:#fce5cd