A fully transparent implementation of the GPT-style transformer — both inference and training — using only scalar operations and explicit nested loops. No matrix libraries, no SIMD, no BLAS. Every multiply and addition is visible in the source code.
The goal is to make the data flow through a transformer completely traceable, as if you were stepping through it on 1980s hardware. If you want to understand exactly what happens inside a transformer at the lowest level, this is the code to read.
Both the forward pass and the backward pass (backpropagation) have been verified against PyTorch, matching to approximately 1e-6 precision.
| File | Purpose |
|---|---|
transformer_test.cpp |
Inference-only forward pass |
transformer_train.cpp |
Forward pass + backward pass + SGD training |
train_and_export.py |
Trains a tiny PyTorch model, exports weights, saves reference logits for inference verification |
train_verify.py |
Exports initial weights, trains with PyTorch SGD, saves losses for training verification |
- A C++ compiler (g++ or clang++)
- Python 3 with PyTorch and NumPy
- No other dependencies
This trains a tiny model in PyTorch, exports the weights, runs the C++ forward pass on the same input, and checks that the outputs match.
python train_and_export.py
g++ -O2 -o transformer_test transformer_test.cpp -lm
./transformer_testExpected output:
PASS: outputs match within floating-point tolerance.
This exports initial weights from PyTorch, trains 50 SGD steps in both Python and C++, and confirms the per-step losses match.
python train_verify.py
g++ -O2 -o transformer_train transformer_train.cpp -lm
./transformer_trainExpected output:
PASS: training losses match closely.
The test model is deliberately tiny (so verification runs in under a second), but it uses the same architecture as real GPT-style transformers:
- Embedding: token lookup table, no positional encoding (not needed for this test)
- 4 transformer layers, each containing:
- Layer norm (pre-norm, as in GPT-2)
- Multi-head causal self-attention (4 heads, separate Q/K/V matrices)
- Residual connection
- Layer norm
- Feed-forward MLP (expand 4x, ReLU, project back)
- Residual connection
- Final layer norm
- Unembedding: linear projection to vocabulary logits
Dimensions: D_MODEL=32, N_HEADS=4, D_HEAD=8, D_MLP=128, VOCAB=256, SEQ_LEN=16.
The forward pass in transformer_test.cpp (lines 136-243) is the clearest
starting point. It is a two-dimensional scan:
- Outer loop over layers (bottom to top)
- Inner loop over sequence positions (left to right)
At each layer, two vectors are added to the residual stream: the attention
output, then the MLP output. Causality is enforced by loop bounds (j <= i)
rather than masking tricks.
The backward pass in transformer_train.cpp reverses each operation, applying
the chain rule step by step: loss gradient, unembed, final layer norm, then
each layer top-to-bottom (MLP backward, attention backward, layer norm backward).
Residual connections let the gradient flow straight through while each sublayer
adds its own contribution.
- Scalar operations only: every dot product is an explicit inner loop. This makes it possible to set a breakpoint on any single multiply-accumulate and see exactly which weight is being applied to which activation.
- No bias in linear layers: keeps the weight count down and matches common practice in modern transformers.
- ReLU instead of GELU: simpler to verify. GELU can be swapped in by changing one line.
- SGD instead of Adam: the simplest possible optimizer, so the weight
update is just
param -= lr * grad. Adam can be added later.
This code is released into the public domain. Use it however you like.