A minimal, readable implementation of Reinforcement Learning from Human Feedback β reward modeling, preference learning, and PPO-style policy optimization from scratch
Modern LLMs like ChatGPT and Claude are aligned to human preferences through RLHF β a multi-stage training process involving supervised pretraining, reward modeling from human comparisons, and reinforcement learning to optimize the policy toward those rewards.
This project implements the full RLHF pipeline at minimal scale using pure PyTorch, with no external alignment libraries. Every component is readable and hackable in under 200 lines total.
Why this exists: Most RLHF tutorials skip the implementation details or rely on opaque wrappers. This repo shows exactly what each stage does mathematically and in code.
Raw Text Prompts
β
βΌ
Tokenizer β whitespace tokenization + vocab encoding
β
βΌ
MiniLM (Policy) β Embedding β LSTM β Linear (language model)
β
βΌ
RewardModel β Embedding β LSTM β Scalar score
β
βΌ
Preference Loss β L = -log Ο(r_chosen - r_rejected)
β
βΌ
PPO-style Update β maximize expected reward via policy gradient
| Module | Description |
|---|---|
mini_lm.py |
MiniLM policy model: Embedding β LSTM β Linear |
reward_model.py |
RewardModel: scores responses with a scalar value |
preference_loss.py |
Bradley-Terry preference loss for pairwise ranking |
ppo_step.py |
Simplified PPO policy gradient update |
tokenizer.py |
Whitespace tokenizer with vocabulary builder |
toy_dataset.py |
Synthetic preference pairs for training and evaluation |
train_reward_model.py |
Training script: evaluates preference ranking |
A lightweight causal language model used to generate token probability distributions.
Input tokens
β
βΌ
Embedding(vocab_size, 128)
β
βΌ
LSTM(128 β 256, batch_first=True)
β
βΌ
Linear(256 β vocab_size) β logits over vocabulary
Scores a response with a single scalar β higher score = more preferred.
Input tokens
β
βΌ
Embedding(vocab_size, 128)
β
βΌ
LSTM(128 β 256, batch_first=True)
β
βΌ
Linear(256 β 1) β scalar reward
Given a chosen response c and rejected response r, the reward model is trained to
maximize the margin between their scores:
L = -log Ο(r_chosen - r_rejected)
Where Ο is the sigmoid function. This is the same loss used in InstructGPT and Claude's RLHF.
The policy is updated to maximize the expected reward signal from the reward model:
loss = -E[reward_model(actions)]
This is a simplified REINFORCE-style update β a full PPO implementation would add a clipped surrogate objective and KL penalty from the reference policy.
git clone https://github.com/Iamyulx/rlhf-mini-implementation.git
cd rlhf-mini-implementation
pip install -r requirements.txtTrain and evaluate the reward model:
python train_reward_model.pyUse components programmatically:
from tokenizer import Tokenizer
from mini_lm import MiniLM
from reward_model import RewardModel
from preference_loss import preference_loss
from ppo_step import ppo_step
import torch
# Build vocabulary
tokenizer = Tokenizer()
tokenizer.build_vocab(["explain phishing", "what is malware"])
vocab_size = tokenizer.vocab_size
# Initialize models
policy = MiniLM(vocab=vocab_size, embed=128)
reward = RewardModel(vocab=vocab_size)
# Score a preference pair
chosen_ids = torch.tensor([tokenizer.encode("phishing is a cyber attack")])
rejected_ids = torch.tensor([tokenizer.encode("phishing is a type of fish")])
r_chosen = reward(chosen_ids)
r_rejected = reward(rejected_ids)
loss = preference_loss(r_chosen, r_rejected)
print(f"Preference loss: {loss.item():.4f}")
# PPO step
ppo_loss = ppo_step(policy, reward, chosen_ids)
print(f"PPO loss: {ppo_loss.item():.4f}")
β οΈ Placeholder values. Runtrain_reward_model.pyand update with real output.
| Metric | Value |
|---|---|
| Reward model avg loss | ~0.661 |
| Chosen response avg score | ~0.050 |
| Rejected response avg score | ~-0.023 |
| Margin (chosen β rejected) | ~0.073 |
| Vocabulary size | 33 tokens |
| PPO avg reward | ~0.041 |
Sample output from train_reward_model.py:
Prompt: Explain phishing
Chosen (score: 0.0905): Phishing is a cyber attack
Rejected (score: 0.0032): Phishing is a type of fish
Loss: 0.6505
Prompt: What is malware
Chosen (score: -0.0093): Malware is malicious software
Rejected (score: -0.0495): Malware is a computer mouse
Loss: 0.6732
Average RewardModel loss: 0.6618
The project includes a minimal cybersecurity preference dataset β a nod to the domain expertise behind the project:
preferences = [
{
"prompt": "Explain phishing",
"chosen": "Phishing is a cyber attack that tricks users into revealing credentials",
"rejected": "Phishing is a type of fish"
},
{
"prompt": "What is malware",
"chosen": "Malware is malicious software designed to damage or gain unauthorized access",
"rejected": "Malware is a computer mouse"
},
]This is intentionally a toy implementation. Key simplifications vs. production RLHF:
| This repo | Production RLHF (InstructGPT / Claude) |
|---|---|
| LSTM policy | Transformer (GPT-2, LLaMA, etc.) |
| Simplified policy gradient | Full PPO with clipped surrogate + KL penalty |
| 33-token vocabulary | 50k+ BPE tokens |
| 2 preference pairs | 10kβ1M human comparisons |
| No reference model | KL divergence from frozen SFT model |
- Replace LSTM with a small Transformer policy
- Implement full PPO with clipping and KL penalty
- Add BPE tokenizer
- Scale preference dataset
- Add W&B logging
- HuggingFace
trlintegration comparison
- Ouyang et al. (2022) β Training language models to follow instructions with human feedback (InstructGPT)
- Christiano et al. (2017) β Deep Reinforcement Learning from Human Preferences
- Schulman et al. (2017) β Proximal Policy Optimization Algorithms
- Bai et al. (2022) β Constitutional AI: Harmlessness from AI Feedback
MIT Β© Iamyulx