A high-performance implementation of OpenAI's Whisper model (Tiny version) written entirely in Mojo ๐ฅ.
This project brings the power of OpenAI's Whisper to the Mojo programming language. By implementing the architecture from the ground up, we leverage Mojo's unique ability to combine Python-like syntax with C-level performance through hardware acceleration, SIMD, and low-level memory control.
Note
This implementation currently supports Whisper-Tiny with greedy decoding for English transcription.
- ๐ฏ Pure Mojo Implementation: Every layer (Encoder, Decoder, Multi-Head Attention) is written in Mojo.
-
๐ Ultra-Fast Inference: Uses KV-Caching for incremental decoding, reducing complexity from
$O(L^2)$ to$O(L)$ . -
๐งต Multi-core Parallelization: Parallelized attention heads and tensor operations using Mojo's
parallelizealgorithm. - โก SIMD Acceleration: Core math operations (Matmul, LayerNorm, GeLU) are vectorized using Mojo's SIMD primitives.
- ๐ง Real-world Audio: Integrated pipeline to process real audio files (MP3/WAV) into Mel spectrograms.
- ๐ Bit-Perfect Tokenization: Fully compatible with OpenAI's tokenizer, producing identical results to the PyTorch reference implementation.
| File | Description |
|---|---|
main.mojo |
๐ฎ The entry point. Orchestrates weight loading, audio processing, and transcription. |
whisper.mojo |
๐ง The "Brain". Contains the Whisper model and incremental decoding logic. |
layers.mojo |
๐งฑ Core building blocks with KVCache support and parallelized attention. |
whisper_tensor.mojo |
๐งฌ Mathematical foundation. Implements parallelized & SIMD-optimized tensor ops. |
tokenizer.mojo |
๐ค Decodes numeric tokens into human-readable text. |
loader.mojo |
๐ฅ Efficient binary weight loader. |
export_weights.py |
๐ Python bridge for weight export and audio preprocessing. |
- Mojo SDK (v24.5+)
- Python Environment with
torch,transformers,soundfile,scipy,requests
-
Clone & Setup
git clone https://github.com/antonvice/whisper.Mojo.git cd whisper.Mojo -
Prepare Weights & Audio
uv run export_weights.py
-
Build & Run (Recommended for Speed) For the best performance, compile to a native binary:
mojo build main.mojo ./main
This implementation is designed to showcase Mojo's performance advantages:
- KV-Cache: Instead of re-computing the entire sequence for every new token, we cache the keys and values of previous tokens.
- Parallel Heads: All attention heads in a layer are processed simultaneously on multiple CPU cores.
- SIMD Vectorization: Inner loops are manually tuned to use 256-bit or 512-bit registers (depending on hardware).
Initializing Whisper Tiny in Mojo...
Loading weights from whisper_tiny_weights.bin...
Transcription:
--------------------
This is my voice on the left. This is my voice on the left hand side...
--------------------
- ๐ Beat Python Baseline: Achieved a total transcription time of 0.74s, outperforming the Python/PyTorch reference implementation (0.78s).
- ๐๏ธ Register-Heavy Decoder: Implemented a peak-performance decoder attention path using register-cached heads. Optimized decoding by switching to serial head loops, eliminating thread pool overhead for small tasks.
-
๐ Contiguous Encoder Pipeline: Redesigned the encoder to use transposed-output convolutions. Implemented Blocked Parallelization (grain size 16) in
conv1dto eliminate cache-line contention (false sharing). -
๐งต Parallel Logit Projection: Optimized the final
$1 \times 51k$ output layer to parallelize across all cores, maximizing throughput for small batch sizes. - ๐งฑ MAX Engine Integration: Leveraged Modular's MAX Engine specialized matmul kernels for all encoder Transformer blocks.
-
๐ก๏ธ Warning Cleanup: Resolved compiler warnings in
Tensor.__moveinit__while ensuring proper move semantics.
-
๐ Advanced
conv1dVectorization: Implemented a "Transpose-DotProduct" strategy for 1D convolutions, enabling full SIMD utilization. Optimized core Whisper filters (K=3) with manual unrolling and hoisting of accumulation logic. -
โก Matrix-Matrix Matmul Tiling: Enhanced the matrix multiplication kernel with 8x tiling and unrolling for the
$N$ dimension. This significantly reduced memory pressure and improved throughput for large encoder blocks ($M=1500$ ). -
๐งฌ Optimized Prefill (Attention): Optimized the prefill/encoder path by switching from manual scalar loops to high-performance
matmul-based head processing. Added parallelized extraction and scatter of attention heads. - ๐พ Layout-Aware Weight Loading: Integrated pre-transposition of convolutional weights during model loading to ensure optimal memory layout for inference.
-
๐ก๏ธ Robust SIMD Kernels: Implemented generalized tail-handling in
matmulandconv1d, ensuring stability across arbitrary sequence lengths and filter sizes. - ๐ Benchmark Results: Successfully reduced total transcription time to ~1.59s (from ~3.3s), achieving a 2x overall speedup. Encoder runtime reduced by over 35%.
- ๐ Optimized Matmul: Implemented dynamic parallelization that adapts to matrix shapes. Added 1D tiling for better cache reuse and switched to hardware-native SIMD widths using
simdwidthof. - โก Vectorized Attention: Fully vectorized the inner loops of
MultiHeadAttention, accelerating both the dot-product score calculation and the weighted value sum. - ๐งฌ Optimized Tensor Primitives: Vectorized
LayerNorm,Softmax, andGeLUoperations. Added safe tail-handling for non-multiple sequence lengths. - ๐พ Fast Memory Operations: Replaced slow scalar loops in KV-cache management with high-performance
memcpytransfers. - ๐งต Threading Improvements: Optimized thread distribution in decoder layers to ensure all CPU cores are utilized during incremental decoding (single-token generation).
This project is licensed under the MIT License - see the LICENSE file for details.