Skip to content

JFan5/rl-arena

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

rl-arena

Modern RL algorithms from scratch -- from Q-Learning to GRPO -- with clean PyTorch code and interactive notebooks.

Understand RL algorithms by building them. Compare PPO vs DPO vs GRPO for LLM alignment.

Algorithm Map

Category Algorithm Key Idea Environment Notebook
Classic Q-Learning Tabular TD control GridWorld 01
Classic DQN Neural Q-function + replay buffer CartPole 02
Classic REINFORCE Monte Carlo policy gradient CartPole 03
Classic A2C Actor-critic with advantage CartPole 03
Classic PPO Clipped surrogate objective CartPole 03
LLM Alignment PPO (RLHF) RM + PPO + KL penalty Text 04
LLM Alignment DPO Direct preference optimization Text 04
LLM Alignment GRPO Group-relative baselines (no critic) Text 04

Key Insight: PPO vs DPO vs GRPO

The three dominant approaches to LLM alignment, compared:

PPO-RLHF (4 models):  Policy -> Generate -> Reward Model -> PPO + Critic
DPO      (2 models):  Policy -> DPO Loss on (chosen, rejected) pairs
GRPO     (3 models):  Policy -> Generate G responses -> Reward Model -> Group-normalize -> PPO (no critic)
Property PPO (RLHF) DPO GRPO
Needs Reward Model Yes No Yes
Needs Critic Yes No No
Needs Preferences No Yes No
Memory Efficient No Yes Yes
Online Learning Yes No Yes
Complexity High Low Medium

When to use what:

  • DPO: You have preference data and want simplicity
  • PPO-RLHF: You need maximum flexibility with any reward signal
  • GRPO: You want online RL but cannot afford a critic (large models)

Quick Start

# Clone and install
git clone https://github.com/your-username/rl-arena.git
cd rl-arena
pip install -e ".[llm]"

# Run notebooks
jupyter notebook notebooks/

Run individual algorithms

# Q-Learning on GridWorld
from rl_arena.environments.grid_world import GridWorld
from rl_arena.classic.q_learning import QLearningAgent

env = GridWorld()
agent = QLearningAgent(env.n_states, env.n_actions)
results = agent.train(env, n_episodes=1000)

# DQN on CartPole
import gymnasium as gym
from rl_arena.classic.dqn import DQNAgent

env = gym.make("CartPole-v1")
agent = DQNAgent(state_dim=4, action_dim=2)
results = agent.train(env, n_episodes=400)

# Compare all LLM alignment methods
from rl_arena.llm_alignment.comparison import AlignmentComparison

comp = AlignmentComparison()
results = comp.run_all(n_steps=100)
comp.print_summary()
comp.plot()

Project Structure

rl-arena/
  rl_arena/
    classic/              # Classic RL algorithms
      q_learning.py       #   Tabular Q-Learning
      dqn.py              #   Deep Q-Network
      policy_gradient.py  #   REINFORCE
      ppo.py              #   Proximal Policy Optimization
      a2c.py              #   Advantage Actor-Critic
    llm_alignment/        # LLM alignment algorithms
      rlhf_ppo.py         #   PPO for RLHF (InstructGPT-style)
      dpo.py              #   Direct Preference Optimization
      grpo.py             #   Group Relative Policy Optimization
      comparison.py       #   Side-by-side comparison utilities
    environments/         # Training environments
      grid_world.py       #   Tabular grid world
      bandit.py           #   Multi-armed bandit
      text_env.py         #   Simplified text env for alignment demos
    utils/                # Shared utilities
      plotting.py         #   Publication-quality visualizations
      logging.py          #   Training metrics logger
      common.py           #   Replay buffers, GAE, seeding
  notebooks/
    01_q_learning.ipynb
    02_dqn.ipynb
    03_ppo.ipynb
    04_rlhf_vs_dpo_vs_grpo.ipynb    # <-- The key notebook

Design Decisions

  • From scratch: Every algorithm is implemented from first principles in PyTorch. No RL libraries.
  • Educational: Each file starts with a clear explanation of the algorithm and its key equations.
  • Small models: LLM alignment demos use tiny transformers (2 layers, 64-dim) that run on CPU in seconds.
  • Gymnasium API: Classic RL environments follow the Gymnasium interface for compatibility.
  • Publication-quality plots: Consistent, attractive matplotlib visualizations throughout.

References

Algorithm Paper
Q-Learning Watkins & Dayan, 1992
DQN Mnih et al., "Human-level control through deep RL" (Nature, 2015)
REINFORCE Williams, 1992
A2C/A3C Mnih et al., "Asynchronous Methods for Deep RL" (2016)
PPO Schulman et al., "Proximal Policy Optimization Algorithms" (2017)
GAE Schulman et al., "High-Dimensional Continuous Control Using GAE" (2016)
RLHF Ouyang et al., "Training language models to follow instructions" (2022)
DPO Rafailov et al., "Direct Preference Optimization" (NeurIPS 2023)
GRPO Shao et al., "DeepSeekMath" (2024)

License

MIT

About

🏟️ Modern RL algorithms from scratch β€” from Q-Learning to GRPO β€” with clean PyTorch code and interactive notebooks. Compare PPO vs DPO vs GRPO for LLM alignment.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors