Skip to content

antonvice/whisper.Mojo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

10 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐ŸŽ™๏ธ Whisper.Mojo

A high-performance implementation of OpenAI's Whisper model (Tiny version) written entirely in Mojo ๐Ÿ”ฅ.

๐Ÿš€ Overview

This project brings the power of OpenAI's Whisper to the Mojo programming language. By implementing the architecture from the ground up, we leverage Mojo's unique ability to combine Python-like syntax with C-level performance through hardware acceleration, SIMD, and low-level memory control.

Note

This implementation currently supports Whisper-Tiny with greedy decoding for English transcription.

โœจ Features

  • ๐ŸŽฏ Pure Mojo Implementation: Every layer (Encoder, Decoder, Multi-Head Attention) is written in Mojo.
  • ๐Ÿš€ Ultra-Fast Inference: Uses KV-Caching for incremental decoding, reducing complexity from $O(L^2)$ to $O(L)$.
  • ๐Ÿงต Multi-core Parallelization: Parallelized attention heads and tensor operations using Mojo's parallelize algorithm.
  • โšก SIMD Acceleration: Core math operations (Matmul, LayerNorm, GeLU) are vectorized using Mojo's SIMD primitives.
  • ๐ŸŽง Real-world Audio: Integrated pipeline to process real audio files (MP3/WAV) into Mel spectrograms.
  • ๐Ÿ” Bit-Perfect Tokenization: Fully compatible with OpenAI's tokenizer, producing identical results to the PyTorch reference implementation.

๐Ÿ“‚ Project Structure

File Description
main.mojo ๐ŸŽฎ The entry point. Orchestrates weight loading, audio processing, and transcription.
whisper.mojo ๐Ÿง  The "Brain". Contains the Whisper model and incremental decoding logic.
layers.mojo ๐Ÿงฑ Core building blocks with KVCache support and parallelized attention.
whisper_tensor.mojo ๐Ÿงฌ Mathematical foundation. Implements parallelized & SIMD-optimized tensor ops.
tokenizer.mojo ๐Ÿ”ค Decodes numeric tokens into human-readable text.
loader.mojo ๐Ÿ“ฅ Efficient binary weight loader.
export_weights.py ๐Ÿ Python bridge for weight export and audio preprocessing.

๐Ÿ› ๏ธ Getting Started

๐Ÿ“‹ Prerequisites

  • Mojo SDK (v24.5+)
  • Python Environment with torch, transformers, soundfile, scipy, requests

๐Ÿ—๏ธ Installation & Execution

  1. Clone & Setup

    git clone https://github.com/antonvice/whisper.Mojo.git
    cd whisper.Mojo
  2. Prepare Weights & Audio

    uv run export_weights.py
  3. Build & Run (Recommended for Speed) For the best performance, compile to a native binary:

    mojo build main.mojo
    ./main

๐Ÿ“Š Optimization Details

This implementation is designed to showcase Mojo's performance advantages:

  1. KV-Cache: Instead of re-computing the entire sequence for every new token, we cache the keys and values of previous tokens.
  2. Parallel Heads: All attention heads in a layer are processed simultaneously on multiple CPU cores.
  3. SIMD Vectorization: Inner loops are manually tuned to use 256-bit or 512-bit registers (depending on hardware).

๐Ÿ“ Example Output

Initializing Whisper Tiny in Mojo...
Loading weights from whisper_tiny_weights.bin...
Transcription:
--------------------
 This is my voice on the left. This is my voice on the left hand side...
--------------------

๐Ÿ“ˆ Changelog

[2025-12-29] - ๐Ÿš€ Performance Breakthrough: Sub-Python Latency

  • ๐Ÿ Beat Python Baseline: Achieved a total transcription time of 0.74s, outperforming the Python/PyTorch reference implementation (0.78s).
  • ๐ŸŽ๏ธ Register-Heavy Decoder: Implemented a peak-performance decoder attention path using register-cached heads. Optimized decoding by switching to serial head loops, eliminating thread pool overhead for small tasks.
  • ๐Ÿ”„ Contiguous Encoder Pipeline: Redesigned the encoder to use transposed-output convolutions. Implemented Blocked Parallelization (grain size 16) in conv1d to eliminate cache-line contention (false sharing).
  • ๐Ÿงต Parallel Logit Projection: Optimized the final $1 \times 51k$ output layer to parallelize across all cores, maximizing throughput for small batch sizes.
  • ๐Ÿงฑ MAX Engine Integration: Leveraged Modular's MAX Engine specialized matmul kernels for all encoder Transformer blocks.
  • ๐Ÿ›ก๏ธ Warning Cleanup: Resolved compiler warnings in Tensor.__moveinit__ while ensuring proper move semantics.

[2025-12-26] - Performance Optimization Sprint (Part 2)

  • ๐Ÿš€ Advanced conv1d Vectorization: Implemented a "Transpose-DotProduct" strategy for 1D convolutions, enabling full SIMD utilization. Optimized core Whisper filters (K=3) with manual unrolling and hoisting of accumulation logic.
  • โšก Matrix-Matrix Matmul Tiling: Enhanced the matrix multiplication kernel with 8x tiling and unrolling for the $N$ dimension. This significantly reduced memory pressure and improved throughput for large encoder blocks ($M=1500$).
  • ๐Ÿงฌ Optimized Prefill (Attention): Optimized the prefill/encoder path by switching from manual scalar loops to high-performance matmul-based head processing. Added parallelized extraction and scatter of attention heads.
  • ๐Ÿ’พ Layout-Aware Weight Loading: Integrated pre-transposition of convolutional weights during model loading to ensure optimal memory layout for inference.
  • ๐Ÿ›ก๏ธ Robust SIMD Kernels: Implemented generalized tail-handling in matmul and conv1d, ensuring stability across arbitrary sequence lengths and filter sizes.
  • ๐Ÿ“ˆ Benchmark Results: Successfully reduced total transcription time to ~1.59s (from ~3.3s), achieving a 2x overall speedup. Encoder runtime reduced by over 35%.

[2025-12-25] - Performance Optimization Sprint (Part 1)

  • ๐Ÿš€ Optimized Matmul: Implemented dynamic parallelization that adapts to matrix shapes. Added 1D tiling for better cache reuse and switched to hardware-native SIMD widths using simdwidthof.
  • โšก Vectorized Attention: Fully vectorized the inner loops of MultiHeadAttention, accelerating both the dot-product score calculation and the weighted value sum.
  • ๐Ÿงฌ Optimized Tensor Primitives: Vectorized LayerNorm, Softmax, and GeLU operations. Added safe tail-handling for non-multiple sequence lengths.
  • ๐Ÿ’พ Fast Memory Operations: Replaced slow scalar loops in KV-cache management with high-performance memcpy transfers.
  • ๐Ÿงต Threading Improvements: Optimized thread distribution in decoder layers to ensure all CPU cores are utilized during incremental decoding (single-token generation).

๐Ÿ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.

About

OpenAI Whisper model implementation in pure ๐Ÿ”ฅ

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors