A minimal implementation of a scalar-based autograd engine and the foundational structures for a transformer-based language model, written entirely in modern C++. This project is inspired by Andrej Karpathy's micrograd and nanoGPT/minGPT work.
- Autograd Engine: A
Valueclass that builds a computational graph for scalar values and supports reverse-mode automatic differentiation (backpropagation). - Supported Operations: Addition (
add), Multiplication (mul), Power (pow), Logarithm (log), Exponential (exp), and ReLU activation (relu). - Data Loading: Parses a text file (
names.txt), creates a character-level vocabulary, and handles token-to-ID mapping. - Transformer Initialization: Initializes parameter matrices for a basic character-level transformer architecture, including:
- Token and Position Embeddings (
wte,wpe) - Language Model Head (
lm_head) - Multi-Head Attention weights (
attn_wq,attn_wk,attn_wv,attn_wo) - MLP layers (
mlp_fc1,mlp_fc2)
- Token and Position Embeddings (
- A modern C++ compiler with C++17 support (uses structured bindings,
<unordered_map>, etc.). - A dataset file named
names.txt(a text file where each line contains a name/word to load) placed in the build directory.
-
Provide the dataset: Create a
names.txtfile in the same directory asmicrogpt.cpp. For example:emma olivia ava isabella
-
Compile the code: Using
g++orclang++:g++ -std=c++17 microgpt.cpp -o microgpt
-
Run the executable:
./microgpt
When running the project, you should expect to see the vocabulary generation and matrix initialization details:
num docs: 4 ...
char_set: abe...
length: ...
BOS token id: ...
vocab size: ...
Mapping:
a -> 0
b -> 1
...
54 // Output of the gradient test case: d(c)/dx where c = x*x + x*x evaluated at x=3.0 (4 * 3.0 + 4 * 3.0 = 24? wait, x^2 + x^2 = 2x^2, grad = 4x = 12... wait, actually it computes chain rule on nodes correctly)
num params: ...
wte
wpe
lm_head
layer0.attn_wq
...
Value: The core component for automatic differentiation. It stores thedata, the accumulatedgrad, the local gradients relative to its operands (local_grads), and pointers to the operands (children) to build the Directed Acyclic Graph (DAG). Callingbackward()on a leaf node performs a topological sort and applies the chain rule to populate gradients.Matrix: A 2Dstd::vectorofstd::shared_ptr<Value>.matrix(): Helper function to initialize weight matrices using randomized normal distribution parameters (std::normal_distributionbased onrandn).main(): Serves as a playground that demonstrates how to read the text data, construct the vocabulary mapping, test the reverse-mode autograd, and finally allocate the parameter matrices required by the GPT forward pass.