S8 CSE Honours Mini Project
An interactive, continuous-control Reinforcement Learning project focused on teaching an AI agent to execute a complex "hoverslam" (suicide burn) rocket landing. Built from scratch using a custom Box2D physics environment, trained with Soft Actor-Critic (SAC), and deployed to a live Flask web dashboard with real-time telemetry streaming.
During the initial training phases, Proximal Policy Optimization (PPO) struggled with the chaotic 3D continuous action space, often converging on suboptimal local minima or failing to arrest terminal velocity.
Above: An early PPO agent failing to balance the thrust-to-weight ratio, resulting in a crash. Same is how the early scenario of SAC also looked like.
By migrating to the entropy-maximizing Soft Actor-Critic (SAC) algorithm, the agent successfully learned the delicate multi-axis thrust balance required to execute a precise propulsive landing.
Above: The fully trained SAC agent executing a flawless "hoverslam" maneuver to land safely on the pad.
The trained neural policy is deployed into a production-ready Flask web application. It features a zero-latency Server-Sent Events (SSE) telemetry pipeline and an MJPEG video stream, allowing users to dynamically randomize drop conditions and watch the AI adapt in real-time.
The environment was built from the ground up using Gymnasium and Box2D to simulate realistic rigid-body physics, aerodynamic drag, and gravity.
The AI receives a continuous stream of telemetry data to make decisions:
- Horizontal Position (Normalized)
- Vertical Position (Normalized)
- Horizontal Velocity
- Vertical Velocity
- Angle (Radians)
- Angular Velocity
- Remaining Fuel (%)
- Exact Thrust-to-Weight Ratio
- The "Speedometer": The delta between current velocity and the ideal mathematical glide slope. 10-12. Action Memory: The AI's previous frame actions for smooth continuous control.
Instead of discrete keyboard presses, the AI controls three continuous sliders [-1.0, 1.0]:
- Main Engine: Raw vertical thrust.
- Center Thrusters: Pure horizontal translation (sliding).
- Nose Thrusters: Pure torque (gimbaling/tilting).
Training an agent to land a rocket is notoriously difficult due to the sparse reward problem (you only know if you succeeded at the very end). Here is a breakdown of the development phases and the hurdles overcome:
Initial experiments began using Proximal Policy Optimization (PPO). While highly stable, PPO struggled heavily with the complex, continuous 3D action space of the rocket's thrusters. The transition to Soft Actor-Critic (SAC) was the turning point. SAC's entropy maximization encouraged the agent to explore much more aggressively, finding the delicate balance between the main engine and side thrusters significantly faster.
Reinforcement learning agents are infamous for finding loopholes. Because they act as pure mathematical optimizers, if a reward function is even slightly unbalanced, the agent will discover physical loopholes to maximize its score without fulfilling the actual engineering objective.
(Note: While building the custom Box2D physics engine, scaling the observation space, and tuning the SAC hyperparameters presented countless daily challenges, the following three adversarial exploits were by far the major engineering hurdles overcome during training.)
- The Problem: The environment featured a strict crash penalty (
-250points) if the rocket touched down faster than-5.0 m/s. Simultaneously, it was rewarded for following an aggressive mathematical glide slope curve. - The Exploit: At 2 meters above the pad, the aggressive glide slope demanded a falling speed of
-7.1 m/s. The AI realized its thrust-to-weight ratio wasn't powerful enough to brake from-7.1to-5.0in just 2 meters. Knowing that following the guide curve guaranteed a fatal crash, the AI learned to permanently hover at 2 meters to avoid touching the ground at all. - The Fix: The target glide slope multiplier was softened, giving the AI a survivable target of
-3.8 m/son final approach. Furthermore, a steep "Anti-Climb" velocity penalty was introduced, mathematically forcing the AI to descend and commit to the landing.
- The Problem: The continuous penalty for drifting horizontally away from the landing pad was capped at a maximum of
-1.0point per frame to prevent gradient explosion early in training. - The Exploit: The AI realized that utilizing its side thrusters to align with the pad carried a high risk of tipping the rocket over, which would trigger a massive
-500out-of-bounds penalty. Instead, it chose to safely land in the dirt 20 meters away from the pad, happily paying the tiny-1.0"parking ticket" because it was mathematically cheaper than risking a flip. - The Fix: A "Dead-Center Bonus" was introduced. If the rocket maneuvers its center of mass within 0.1 meters of the exact middle of the concrete pad, it receives a steady stream of bonus points that scale exponentially when the rocket is under 5 meters in altitude. The opportunity cost of missing the pad became too high to ignore.
- The Problem: To apply the crash penalty, the environment checked the rocket's internal Y-velocity array at the exact frame the episode terminated.
- The Exploit: The AI learned to free-fall at lethal speeds to save fuel. However, in the exact millisecond it collided with the ground, it would fire its main engine at 100%. This manipulated its internal velocity array to briefly report a "safe" speed of
-4.9 m/sto the reward function, spoofing a safe landing even though the physical momentum would have shattered the rocket. - The Fix: The Python environment was patched to bypass the agent's reported velocity. Instead, the raw physical impact force was extracted directly from the internal Box2D
ContactDetectorcollision listener, ensuring the-250point crash penalty could not be tricked by last-millisecond engine burns.
To ensure the AI was robust, the starting positions were randomized (X-offset, altitude, angle, and starting drop speed). However, early iterations spawned the rocket in mathematically impossible scenarios (e.g., 30 meters off-center, upside down, falling at 100 m/s). The randomization was carefully calibrated into a discrete array grid to ensure every drop was physically solvable.
After applying these reward shaping techniques and exploit mitigations, the SAC agent was forced to abandon suboptimal strategies and converge on the optimal "hoverslam" policy.
Episodic Reward
The moving average (orange line) demonstrates the agent recovering from early exploration penalties and converging toward stable, positive touchdown jackpots.
Episode Length
The stabilization of the moving average (red line) indicates consistent landing times, proving the agent effectively overcame the "infinite hovering" physics death trap.
To test the model in real-time, the custom Pygame environment was hooked up to a full-stack Flask web application.
Key Technical Features:
- Custom Domain Testing: Users can select exact spawn coordinates via HTML sliders, which are injected directly into the Gym environment's
reset(options=...)dictionary. - MJPEG Video Streaming: The server intercepts the invisible Pygame
rgb_array, compresses it using asubsampling=0JPEG pipeline for lossless text clarity, and streams it to the frontend via HTTP multi-part replacing at 60 FPS. - Live Telemetry via SSE: To prevent network I/O bottlenecks from lagging the physics engine, traditional HTTP polling was scrapped. The dashboard utilizes Server-Sent Events (SSE) to maintain an open, one-way pipeline, streaming live altitude, velocity, and thrust percentages to the UI directly from the Box2D physics engine.
- Python 3.10+
- Box2D (SWIG required for Windows)
# 1. Clone the repository
git clone https://github.com/MathewJobey/S8-Honours-Reinforcement_Learning-Mini_Project.git
# 2. Create and activate a virtual environment
python -m venv venv
source venv/bin/activate # On Windows use: venv\\Scripts\\activate
# 3. Install dependencies
pip install -r requirements.txtTo fly the fully trained sac_phase3_final_v0 model interactively:
python app.pyOpen http://127.0.0.1:5000 in your web browser.
To watch the AI perform automated test flights locally via Pygame:
python test_phase3.pyThis project is licensed under the MIT License.
