Modern RL algorithms from scratch -- from Q-Learning to GRPO -- with clean PyTorch code and interactive notebooks.
Understand RL algorithms by building them. Compare PPO vs DPO vs GRPO for LLM alignment.
| Category | Algorithm | Key Idea | Environment | Notebook |
|---|---|---|---|---|
| Classic | Q-Learning | Tabular TD control | GridWorld | 01 |
| Classic | DQN | Neural Q-function + replay buffer | CartPole | 02 |
| Classic | REINFORCE | Monte Carlo policy gradient | CartPole | 03 |
| Classic | A2C | Actor-critic with advantage | CartPole | 03 |
| Classic | PPO | Clipped surrogate objective | CartPole | 03 |
| LLM Alignment | PPO (RLHF) | RM + PPO + KL penalty | Text | 04 |
| LLM Alignment | DPO | Direct preference optimization | Text | 04 |
| LLM Alignment | GRPO | Group-relative baselines (no critic) | Text | 04 |
The three dominant approaches to LLM alignment, compared:
PPO-RLHF (4 models): Policy -> Generate -> Reward Model -> PPO + Critic
DPO (2 models): Policy -> DPO Loss on (chosen, rejected) pairs
GRPO (3 models): Policy -> Generate G responses -> Reward Model -> Group-normalize -> PPO (no critic)
| Property | PPO (RLHF) | DPO | GRPO |
|---|---|---|---|
| Needs Reward Model | Yes | No | Yes |
| Needs Critic | Yes | No | No |
| Needs Preferences | No | Yes | No |
| Memory Efficient | No | Yes | Yes |
| Online Learning | Yes | No | Yes |
| Complexity | High | Low | Medium |
When to use what:
- DPO: You have preference data and want simplicity
- PPO-RLHF: You need maximum flexibility with any reward signal
- GRPO: You want online RL but cannot afford a critic (large models)
# Clone and install
git clone https://github.com/your-username/rl-arena.git
cd rl-arena
pip install -e ".[llm]"
# Run notebooks
jupyter notebook notebooks/# Q-Learning on GridWorld
from rl_arena.environments.grid_world import GridWorld
from rl_arena.classic.q_learning import QLearningAgent
env = GridWorld()
agent = QLearningAgent(env.n_states, env.n_actions)
results = agent.train(env, n_episodes=1000)
# DQN on CartPole
import gymnasium as gym
from rl_arena.classic.dqn import DQNAgent
env = gym.make("CartPole-v1")
agent = DQNAgent(state_dim=4, action_dim=2)
results = agent.train(env, n_episodes=400)
# Compare all LLM alignment methods
from rl_arena.llm_alignment.comparison import AlignmentComparison
comp = AlignmentComparison()
results = comp.run_all(n_steps=100)
comp.print_summary()
comp.plot()rl-arena/
rl_arena/
classic/ # Classic RL algorithms
q_learning.py # Tabular Q-Learning
dqn.py # Deep Q-Network
policy_gradient.py # REINFORCE
ppo.py # Proximal Policy Optimization
a2c.py # Advantage Actor-Critic
llm_alignment/ # LLM alignment algorithms
rlhf_ppo.py # PPO for RLHF (InstructGPT-style)
dpo.py # Direct Preference Optimization
grpo.py # Group Relative Policy Optimization
comparison.py # Side-by-side comparison utilities
environments/ # Training environments
grid_world.py # Tabular grid world
bandit.py # Multi-armed bandit
text_env.py # Simplified text env for alignment demos
utils/ # Shared utilities
plotting.py # Publication-quality visualizations
logging.py # Training metrics logger
common.py # Replay buffers, GAE, seeding
notebooks/
01_q_learning.ipynb
02_dqn.ipynb
03_ppo.ipynb
04_rlhf_vs_dpo_vs_grpo.ipynb # <-- The key notebook
- From scratch: Every algorithm is implemented from first principles in PyTorch. No RL libraries.
- Educational: Each file starts with a clear explanation of the algorithm and its key equations.
- Small models: LLM alignment demos use tiny transformers (2 layers, 64-dim) that run on CPU in seconds.
- Gymnasium API: Classic RL environments follow the Gymnasium interface for compatibility.
- Publication-quality plots: Consistent, attractive matplotlib visualizations throughout.
| Algorithm | Paper |
|---|---|
| Q-Learning | Watkins & Dayan, 1992 |
| DQN | Mnih et al., "Human-level control through deep RL" (Nature, 2015) |
| REINFORCE | Williams, 1992 |
| A2C/A3C | Mnih et al., "Asynchronous Methods for Deep RL" (2016) |
| PPO | Schulman et al., "Proximal Policy Optimization Algorithms" (2017) |
| GAE | Schulman et al., "High-Dimensional Continuous Control Using GAE" (2016) |
| RLHF | Ouyang et al., "Training language models to follow instructions" (2022) |
| DPO | Rafailov et al., "Direct Preference Optimization" (NeurIPS 2023) |
| GRPO | Shao et al., "DeepSeekMath" (2024) |
MIT