Scalar Transformer in C++

A fully transparent implementation of the GPT-style transformer — both inference and training — using only scalar operations and explicit nested loops. No matrix libraries, no SIMD, no BLAS. Every multiply and addition is visible in the source code.

The goal is to make the data flow through a transformer completely traceable, as if you were stepping through it on 1980s hardware. If you want to understand exactly what happens inside a transformer at the lowest level, this is the code to read.

Both the forward pass and the backward pass (backpropagation) have been verified against PyTorch, matching to approximately 1e-6 precision.

What's here

File	Purpose
`transformer_test.cpp`	Inference-only forward pass
`transformer_train.cpp`	Forward pass + backward pass + SGD training
`train_and_export.py`	Trains a tiny PyTorch model, exports weights, saves reference logits for inference verification
`train_verify.py`	Exports initial weights, trains with PyTorch SGD, saves losses for training verification

Requirements

A C++ compiler (g++ or clang++)
Python 3 with PyTorch and NumPy
No other dependencies

How to verify inference

This trains a tiny model in PyTorch, exports the weights, runs the C++ forward pass on the same input, and checks that the outputs match.

python train_and_export.py
g++ -O2 -o transformer_test transformer_test.cpp -lm
./transformer_test

Expected output:

PASS: outputs match within floating-point tolerance.

How to verify training

This exports initial weights from PyTorch, trains 50 SGD steps in both Python and C++, and confirms the per-step losses match.

python train_verify.py
g++ -O2 -o transformer_train transformer_train.cpp -lm
./transformer_train

Expected output:

PASS: training losses match closely.

Model architecture

The test model is deliberately tiny (so verification runs in under a second), but it uses the same architecture as real GPT-style transformers:

Embedding: token lookup table, no positional encoding (not needed for this test)
4 transformer layers, each containing:
- Layer norm (pre-norm, as in GPT-2)
- Multi-head causal self-attention (4 heads, separate Q/K/V matrices)
- Residual connection
- Layer norm
- Feed-forward MLP (expand 4x, ReLU, project back)
- Residual connection
Final layer norm
Unembedding: linear projection to vocabulary logits

Dimensions: D_MODEL=32, N_HEADS=4, D_HEAD=8, D_MLP=128, VOCAB=256, SEQ_LEN=16.

How to read the code

The forward pass in transformer_test.cpp (lines 136-243) is the clearest starting point. It is a two-dimensional scan:

Outer loop over layers (bottom to top)
Inner loop over sequence positions (left to right)

At each layer, two vectors are added to the residual stream: the attention output, then the MLP output. Causality is enforced by loop bounds (j <= i) rather than masking tricks.

The backward pass in transformer_train.cpp reverses each operation, applying the chain rule step by step: loss gradient, unembed, final layer norm, then each layer top-to-bottom (MLP backward, attention backward, layer norm backward). Residual connections let the gradient flow straight through while each sublayer adds its own contribution.

Design choices

Scalar operations only: every dot product is an explicit inner loop. This makes it possible to set a breakpoint on any single multiply-accumulate and see exactly which weight is being applied to which activation.
No bias in linear layers: keeps the weight count down and matches common practice in modern transformers.
ReLU instead of GELU: simpler to verify. GELU can be swapped in by changing one line.
SGD instead of Adam: the simplest possible optimizer, so the weight update is just param -= lr * grad. Adam can be added later.

License

This code is released into the public domain. Use it however you like.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
README.txt		README.txt
train_and_export.py		train_and_export.py
train_verify.py		train_verify.py
transformer_test.cpp		transformer_test.cpp
transformer_train.cpp		transformer_train.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scalar Transformer in C++

What's here

Requirements

How to verify inference

How to verify training

Model architecture

How to read the code

Design choices

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Scalar Transformer in C++

What's here

Requirements

How to verify inference

How to verify training

Model architecture

How to read the code

Design choices

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages