A faithful “from-scratch” implementation of the Transformer architecture:
- Token/positional embeddings
- Multi-Head Self-Attention
- Encoder/Decoder stacks with residual connections + LayerNorm
- Position-wise feed-forward networks
- Teacher forcing / masking for decoder inputs (if seq2seq)
- Clear training pipeline with early stopping & LR scheduling
- Embedding(d_model) + sinusoidal positional encoding
- N encoder layers:
- Multi-Head Self-Attention (Q/K/V projections, scaled dot-product, softmax)
- Residual + LayerNorm
- FFN: Linear → ReLU → Linear
- N decoder layers (if seq2seq):
- Masked self-attention
- Cross-attention with encoder outputs
- FFN block with residuals/LayerNorm
- Classifier head (for classification) or linear projection to vocab (for generation)
Optimizer: AdamW
Scheduler: Cosine or warmup + decay
Loss: CrossEntropy (label smoothing optional)
torch==2.4.1
numpy==2.1.3
pandas==2.2.3
matplotlib==3.9.3
seaborn==0.13.2
scikit-learn==1.5.2
tqdm==4.66.5- Proper masking (pad & causal) is crucial for stable training
- Warmup + cosine schedule helps initial convergence
- LayerNorm placement (Pre-Norm) can improve gradient flow on deeper stacks
The dataset is in the repo.
All the results including, test, train, validation, corellations are in the notebook.