A compact GPT-style language model built from scratch, trained on Shakespeare. Learns character-level patterns, style, and rhythm from Shakespeare’s works. Fully implemented in PyTorch with a decoder-only Transformer architecture.
Click to watch GPT-Mini generate Shakespearean text in real time.
GPT-Mini is a decoder-only Transformer implemented in PyTorch, trained on the complete works of Shakespeare (~1.1M characters).
It learns character-level language modeling, capturing voice, structure, and rhythm from Shakespeare’s plays and poetry.
Prompt → Tokenize → Embed → [Decoder ×6] → Linear → Softmax → Next Character
-
Tokenizer
- Character-level: each character → unique token
- No subword or BPE tokenization
-
Embeddings
- Token embedding + learned positional embeddings (GPT-style)
-
Decoder Block (×6)
- Pre-LayerNorm → Causal Self-Attention → Residual
- Pre-LayerNorm → Feedforward (4× width, GELU) → Residual
-
Output
- Linear projection tied to token embeddings
- Softmax for next-character probabilities
-
Generation
- Autoregressive, supports temperature and top-k sampling
graph TD
A["Input: 'The'"] --> B["Tokenizer:<br/>'T'→56, 'h'→4, 'e'→32"]
B --> C["Token Embeddings<br/>[56, 4, 32] → [[vec1], [vec2], [vec3]]<br/>Shape: [3, 128]"]
C --> D["Positional Embeddings<br/>+pos_enc[0], +pos_enc[1], +pos_enc[2]<br/>Shape: [3, 128]"]
D --> E["Decoder Block 1"]
E --> F["..."]
F --> G["Decoder Block 6"]
G --> H["Final LayerNorm<br/>Shape: [3, 128]"]
H --> I["LM Head (Linear)<br/>[3, 128] → [3, 65]"]
I --> J["Softmax → Probabilities<br/>for all 65 chars"]
J --> K["Prediction:<br/>Next char after 'e'<br/>(e.g., ' ' or ',')"]
subgraph "Decoder Block (Single Layer)"
L1["Pre-LayerNorm"]
L1 --> L2["Causal Self-Attention<br/>4 Heads, Masked"]
L2 --> L3["Residual Add"]
L3 --> L4["Pre-LayerNorm"]
L4 --> L5["MLP (128→512→128)<br/>GELU Activation"]
L5 --> L6["Residual Add"]
end
E -.-> L1
L6 -.-> F
Autoregressive Generation Loop:
- Model predicts the most likely next character (e.g., a space
' '). - This character is appended to the input sequence (
"The "). - The process repeats, using the updated sequence as input, predicting the next character, and appending it.
- Continues until the maximum sequence length (e.g., 256 characters) is reached or a stopping condition is met.
| Component | Details |
|---|---|
| Architecture | Decoder-only Transformer |
| Layers | 6 |
| Embedding Dim | 128 |
| Attention Heads | 4 |
| Context Length | 256 |
| Vocabulary Size | 65 (character-level) |
| Parameters | 1.23M |
| Positional Encoding | Learned embeddings |
| LayerNorm | Pre-attention & pre-MLP |
| Training Steps | 10,000 (~on GPU) |
| Metric | Value |
|---|---|
| Perplexity | 3.02 |
| Character Accuracy | 64.9% |
| NLL | 1.105 |
| BPC | 1.594 |
Evaluated on held-out Shakespeare text. Metrics stored in
evaluation_metrics.json.
gpt-mini/
├── src/model/ # attention.py, transformer.py, embeddings.py
├── src/data/ # tokenizer.py, dataloader.py
├── src/train/ # trainer.py, config.py
├── src/utils/ # debug.py, export.py
├── configs/gpt1_char.yaml
├── deploy/app.py # Gradio: Generate + Evaluate
├── tests/ # Unit tests
├── data/tinyshakespeare.txt
├── train.py
└── evaluation_metrics.json
python deploy/app.py- Type a prompt (e.g.,
"To be or not to") - Generates text character-by-character in Shakespearean style
- Transformer architecture: Vaswani et al., Attention Is All You Need (2017)
- Positional embeddings in GPT: Learned embeddings, GPT-2 style
- Educational guidance: Karpathy, “Let’s build GPT from scratch”
All code and implementation are original, reflecting the design and behavior described.
Small. Transparent. Understandable. GPT-Mini captures how transformers generate language, with focus on clarity and understanding.