tensor_parallel_toy

A toy implementation of Tensor Parallel (TP) + Sequence Parallel (SP) with communication-computation overlap for educational purposes.

Companion blog post: Tensor Parallel + Sequence Parallel — A Deep Dive

Based on the original implementation by @xiabingquan.

Quick Start

# Run unit tests (needs >= 4 GPUs)
python -m tensor_parallel_toy test

# Run profiling (no TP / TP / TP+overlap, needs >= 4 GPUs)
python -m tensor_parallel_toy profile

# Or via shell scripts
bash run_tests.sh
bash run_profile.sh

Code Structure

File	Description
`config.py`	`ModelConfig` dataclass
`initialize.py`	Distributed init, global seed, CPU weight init
`parallel_linear.py`	`ColumnParallelLinear`, `RowParallelLinear`, overlap variants, comm helpers
`model.py`	`RMSNorm`, `Attention`, `MLP`, `TransformerBlock`, `Transformer`
`utils.py`	Shared distributed testing utilities
`test_tp.py`	Unit tests (TP vs no-TP, overlap vs no-overlap)
`profile_memory.py`	Training loop profiling (loss, grad norm, step time, peak memory)
`__init__.py`	Package exports
`__main__.py`	CLI entry point

Key Implementation Details

TP Linear: Custom autograd.Function per layer. Forward saves the AllGathered input for backward reuse. use_overlap flag switches between basic and overlap autograd Functions within the same Module.
Overlap AG+GEMM: Ring P2P exchange (dist.batch_isend_irecv) with pipelined GEMM on dual CUDA streams.
Overlap GEMM+RS: Chunked GEMM with per-chunk dist.reduce(dst=i) (not reduce_scatter, which would scatter at the wrong granularity).
RMSNorm grads: Replicated weights need AllReduce after backward since each rank only sees s/n positions.
Weight init: Global torch.manual_seed → all ranks draw from the same RNG in the same order → sharded weights are consistent with the full model.

Requirements

Python >= 3.8
PyTorch >= 2.1
Multiple CUDA GPUs (>= 4 for profiling)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tensor_parallel_toy

Quick Start

Code Structure

Key Implementation Details

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
__main__.py		__main__.py
config.py		config.py
initialize.py		initialize.py
model.py		model.py
parallel_linear.py		parallel_linear.py
profile_memory.py		profile_memory.py
run_profile.sh		run_profile.sh
run_tests.sh		run_tests.sh
test_tp.py		test_tp.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

tensor_parallel_toy

Quick Start

Code Structure

Key Implementation Details

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages