Skip to content

rishishanthan/transformer-from-scratch-seq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

Transformer from Scratch (PyTorch)

A faithful “from-scratch” implementation of the Transformer architecture:

  • Token/positional embeddings
  • Multi-Head Self-Attention
  • Encoder/Decoder stacks with residual connections + LayerNorm
  • Position-wise feed-forward networks
  • Teacher forcing / masking for decoder inputs (if seq2seq)
  • Clear training pipeline with early stopping & LR scheduling

🧠 Architecture

  • Embedding(d_model) + sinusoidal positional encoding
  • N encoder layers:
    • Multi-Head Self-Attention (Q/K/V projections, scaled dot-product, softmax)
    • Residual + LayerNorm
    • FFN: Linear → ReLU → Linear
  • N decoder layers (if seq2seq):
    • Masked self-attention
    • Cross-attention with encoder outputs
    • FFN block with residuals/LayerNorm
  • Classifier head (for classification) or linear projection to vocab (for generation)

Optimizer: AdamW
Scheduler: Cosine or warmup + decay
Loss: CrossEntropy (label smoothing optional)


🧾 requirements.txt

torch==2.4.1
numpy==2.1.3
pandas==2.2.3
matplotlib==3.9.3
seaborn==0.13.2
scikit-learn==1.5.2
tqdm==4.66.5

📌 Insights

  • Proper masking (pad & causal) is crucial for stable training
  • Warmup + cosine schedule helps initial convergence
  • LayerNorm placement (Pre-Norm) can improve gradient flow on deeper stacks

📁 Dataset

The dataset is in the repo.


📊 Results

All the results including, test, train, validation, corellations are in the notebook.

About

From-scratch Transformer in PyTorch — embeddings, positional encoding, multi-head attention, encoder/decoder stacks, and a clean training pipeline for sequence tasks.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors