🚀 Dropship: Autonomous Rocket Landing via Reinforcement Learning

S8 CSE Honours Mini Project

An interactive, continuous-control Reinforcement Learning project focused on teaching an AI agent to execute a complex "hoverslam" (suicide burn) rocket landing. Built from scratch using a custom Box2D physics environment, trained with Soft Actor-Critic (SAC), and deployed to a live Flask web dashboard with real-time telemetry streaming.

🎥 Demonstration

Algorithm Comparison: PPO vs. SAC

During the initial training phases, Proximal Policy Optimization (PPO) struggled with the chaotic 3D continuous action space, often converging on suboptimal local minima or failing to arrest terminal velocity.

Above: An early PPO agent failing to balance the thrust-to-weight ratio, resulting in a crash. Same is how the early scenario of SAC also looked like.

By migrating to the entropy-maximizing Soft Actor-Critic (SAC) algorithm, the agent successfully learned the delicate multi-axis thrust balance required to execute a precise propulsive landing.

Above: The fully trained SAC agent executing a flawless "hoverslam" maneuver to land safely on the pad.

Interactive Web Dashboard

The trained neural policy is deployed into a production-ready Flask web application. It features a zero-latency Server-Sent Events (SSE) telemetry pipeline and an MJPEG video stream, allowing users to dynamically randomize drop conditions and watch the AI adapt in real-time.

🧠 The Environment & AI Architecture

The environment was built from the ground up using Gymnasium and Box2D to simulate realistic rigid-body physics, aerodynamic drag, and gravity.

Observation Space (12 Dimensions)

The AI receives a continuous stream of telemetry data to make decisions:

Horizontal Position (Normalized)
Vertical Position (Normalized)
Horizontal Velocity
Vertical Velocity
Angle (Radians)
Angular Velocity
Remaining Fuel (%)
Exact Thrust-to-Weight Ratio
The "Speedometer": The delta between current velocity and the ideal mathematical glide slope. 10-12. Action Memory: The AI's previous frame actions for smooth continuous control.

Action Space (3 Dimensions - Continuous)

Instead of discrete keyboard presses, the AI controls three continuous sliders [-1.0, 1.0]:

Main Engine: Raw vertical thrust.
Center Thrusters: Pure horizontal translation (sliding).
Nose Thrusters: Pure torque (gimbaling/tilting).

🧗‍♂️ The Engineering Journey & Challenges

Training an agent to land a rocket is notoriously difficult due to the sparse reward problem (you only know if you succeeded at the very end). Here is a breakdown of the development phases and the hurdles overcome:

1. The Algorithm Shift (PPO vs. SAC)

Initial experiments began using Proximal Policy Optimization (PPO). While highly stable, PPO struggled heavily with the complex, continuous 3D action space of the rocket's thrusters. The transition to Soft Actor-Critic (SAC) was the turning point. SAC's entropy maximization encouraged the agent to explore much more aggressively, finding the delicate balance between the main engine and side thrusters significantly faster.

2. Reward Shaping & Adversarial Exploits

Reinforcement learning agents are infamous for finding loopholes. Because they act as pure mathematical optimizers, if a reward function is even slightly unbalanced, the agent will discover physical loopholes to maximize its score without fulfilling the actual engineering objective.

(Note: While building the custom Box2D physics engine, scaling the observation space, and tuning the SAC hyperparameters presented countless daily challenges, the following three adversarial exploits were by far the major engineering hurdles overcome during training.)

Exploit A: The "Physics Death Trap" (Infinite Hovering)

The Problem: The environment featured a strict crash penalty (-250 points) if the rocket touched down faster than -5.0 m/s. Simultaneously, it was rewarded for following an aggressive mathematical glide slope curve.
The Exploit: At 2 meters above the pad, the aggressive glide slope demanded a falling speed of -7.1 m/s. The AI realized its thrust-to-weight ratio wasn't powerful enough to brake from -7.1 to -5.0 in just 2 meters. Knowing that following the guide curve guaranteed a fatal crash, the AI learned to permanently hover at 2 meters to avoid touching the ground at all.
The Fix: The target glide slope multiplier was softened, giving the AI a survivable target of -3.8 m/s on final approach. Furthermore, a steep "Anti-Climb" velocity penalty was introduced, mathematically forcing the AI to descend and commit to the landing.

Exploit B: The "Cheap Parking Ticket" (X-Axis Exploit)

The Problem: The continuous penalty for drifting horizontally away from the landing pad was capped at a maximum of -1.0 point per frame to prevent gradient explosion early in training.
The Exploit: The AI realized that utilizing its side thrusters to align with the pad carried a high risk of tipping the rocket over, which would trigger a massive -500 out-of-bounds penalty. Instead, it chose to safely land in the dirt 20 meters away from the pad, happily paying the tiny -1.0 "parking ticket" because it was mathematically cheaper than risking a flip.
The Fix: A "Dead-Center Bonus" was introduced. If the rocket maneuvers its center of mass within 0.1 meters of the exact middle of the concrete pad, it receives a steady stream of bonus points that scale exponentially when the rocket is under 5 meters in altitude. The opportunity cost of missing the pad became too high to ignore.

Exploit C: The "Ghost Brake" Defense

The Problem: To apply the crash penalty, the environment checked the rocket's internal Y-velocity array at the exact frame the episode terminated.
The Exploit: The AI learned to free-fall at lethal speeds to save fuel. However, in the exact millisecond it collided with the ground, it would fire its main engine at 100%. This manipulated its internal velocity array to briefly report a "safe" speed of -4.9 m/s to the reward function, spoofing a safe landing even though the physical momentum would have shattered the rocket.
The Fix: The Python environment was patched to bypass the agent's reported velocity. Instead, the raw physical impact force was extracted directly from the internal Box2D ContactDetector collision listener, ensuring the -250 point crash penalty could not be tricked by last-millisecond engine burns.

3. Domain Randomization & "Death Spawns"

To ensure the AI was robust, the starting positions were randomized (X-offset, altitude, angle, and starting drop speed). However, early iterations spawned the rocket in mathematically impossible scenarios (e.g., 30 meters off-center, upside down, falling at 100 m/s). The randomization was carefully calibrated into a discrete array grid to ensure every drop was physically solvable.

4. Final Training Metrics (15.8 Million Steps)

After applying these reward shaping techniques and exploit mitigations, the SAC agent was forced to abandon suboptimal strategies and converge on the optimal "hoverslam" policy.

Episodic Reward The moving average (orange line) demonstrates the agent recovering from early exploration penalties and converging toward stable, positive touchdown jackpots.

Episode Length The stabilization of the moving average (red line) indicates consistent landing times, proving the agent effectively overcame the "infinite hovering" physics death trap.

🌐 The Web Deployment (Interactive Dashboard)

To test the model in real-time, the custom Pygame environment was hooked up to a full-stack Flask web application.

Key Technical Features:

Custom Domain Testing: Users can select exact spawn coordinates via HTML sliders, which are injected directly into the Gym environment's reset(options=...) dictionary.
MJPEG Video Streaming: The server intercepts the invisible Pygame rgb_array, compresses it using a subsampling=0 JPEG pipeline for lossless text clarity, and streams it to the frontend via HTTP multi-part replacing at 60 FPS.
Live Telemetry via SSE: To prevent network I/O bottlenecks from lagging the physics engine, traditional HTTP polling was scrapped. The dashboard utilizes Server-Sent Events (SSE) to maintain an open, one-way pipeline, streaming live altitude, velocity, and thrust percentages to the UI directly from the Box2D physics engine.

⚙️ Installation & Usage

Prerequisites

Python 3.10+
Box2D (SWIG required for Windows)

Setup

# 1. Clone the repository
git clone https://github.com/MathewJobey/S8-Honours-Reinforcement_Learning-Mini_Project.git

# 2. Create and activate a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows use: venv\\Scripts\\activate

# 3. Install dependencies
pip install -r requirements.txt

Launching the Web Dashboard

To fly the fully trained sac_phase3_final_v0 model interactively:

python app.py

Open http://127.0.0.1:5000 in your web browser.

Running a Local Test Flight

To watch the AI perform automated test flights locally via Pygame:

python test_phase3.py

📄 License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
models/Run 6		models/Run 6
rocket_env		rocket_env
static		static
templates		templates
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
length_graph_step_15783184.png		length_graph_step_15783184.png
ppo_training.gif		ppo_training.gif
requirements.txt		requirements.txt
reward_graph_step_15783184.png		reward_graph_step_15783184.png
sac_final_trained.gif		sac_final_trained.gif
test_default.py		test_default.py
test_phase3.py		test_phase3.py
train_phase3.py		train_phase3.py
web_dashboard.jpeg		web_dashboard.jpeg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Dropship: Autonomous Rocket Landing via Reinforcement Learning

🎥 Demonstration

Algorithm Comparison: PPO vs. SAC

Interactive Web Dashboard

🧠 The Environment & AI Architecture

Observation Space (12 Dimensions)

Action Space (3 Dimensions - Continuous)

🧗‍♂️ The Engineering Journey & Challenges

1. The Algorithm Shift (PPO vs. SAC)

2. Reward Shaping & Adversarial Exploits

Exploit A: The "Physics Death Trap" (Infinite Hovering)

Exploit B: The "Cheap Parking Ticket" (X-Axis Exploit)

Exploit C: The "Ghost Brake" Defense

3. Domain Randomization & "Death Spawns"

4. Final Training Metrics (15.8 Million Steps)

🌐 The Web Deployment (Interactive Dashboard)

⚙️ Installation & Usage

Prerequisites

Setup

Launching the Web Dashboard

Running a Local Test Flight

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 Dropship: Autonomous Rocket Landing via Reinforcement Learning

🎥 Demonstration

Algorithm Comparison: PPO vs. SAC

Interactive Web Dashboard

🧠 The Environment & AI Architecture

Observation Space (12 Dimensions)

Action Space (3 Dimensions - Continuous)

🧗‍♂️ The Engineering Journey & Challenges

1. The Algorithm Shift (PPO vs. SAC)

2. Reward Shaping & Adversarial Exploits

Exploit A: The "Physics Death Trap" (Infinite Hovering)

Exploit B: The "Cheap Parking Ticket" (X-Axis Exploit)

Exploit C: The "Ghost Brake" Defense

3. Domain Randomization & "Death Spawns"

4. Final Training Metrics (15.8 Million Steps)

🌐 The Web Deployment (Interactive Dashboard)

⚙️ Installation & Usage

Prerequisites

Setup

Launching the Web Dashboard

Running a Local Test Flight

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages