Run quantized large language models on your CPU at production speed.
Hand-written SIMD kernels. Zero GPU required. Pure Rust.
RAI isn't a wrapper around PyTorch or GGML. Every critical path is hand-written from scratch in Rust:
| Feature | Details |
|---|---|
| W4A8 SIMD GEMM | 2,000+ lines of hand-tuned AVX2+FMA+F16C assembly. PMADDUBSW zero-port5 inner loop, Schraudolph fast-exp, L1 prefetching |
| 4-bit Quantized Inference | Native W4A32 and W4A8 matrix multiplication — no dequantization to f32 |
| Full Transformer Stack | RMSNorm, RoPE, GQA Attention, SwiGLU MLP — all SIMD-accelerated |
| Speculative Decoding | Both standard and self-speculative (layer-skipping) modes |
| GPTQ Compression | Built-in pipeline: calibrate → quantize → pack → run |
| Zero Dependencies on GPU | Runs on any x86_64 CPU with AVX2. No CUDA, no ROCm, no Metal |
RAI is organized as a Rust workspace with 4 crates:
rai/
├── rai-core/ Core types, embeddings, memory, reasoning
├── rai-infer/ Inference engine (the SIMD kernels live here)
│ ├── gemm.rs W4A8 GEMM — AVX2 SIMD inner loops (2,000+ lines)
│ ├── layers.rs RMSNorm, RoPE, GQA attention, SwiGLU MLP
│ ├── model.rs Model loader (.raimodel format)
│ ├── kv_cache.rs KV cache management
│ ├── sampler.rs Temperature, top-k, top-p, min-p sampling
│ ├── format.rs Binary model format parser
│ ├── speculative.rs Standard speculative decoding
│ ├── self_speculative.rs Self-speculative (layer-skipping) decoding
│ ├── ponder.rs Adaptive compute / pondering
│ └── chat_template.rs Chat template formatting
├── rai-compress/ Quantization & compression pipeline
│ ├── gptq.rs GPTQ quantization algorithm
│ ├── compress.rs Model compression orchestrator
│ ├── quantize.rs Weight quantization utilities
│ ├── bitpack.rs 4-bit nibble packing
│ ├── hrc.rs Hierarchical residual compression
│ ├── sac.rs Structured adaptive compression
│ └── sparse.rs Weight sparsification
└── rai-server/ API server with MCP support
The GEMM kernel is the heart of RAI. Here's what makes it fast:
Input (f32) ──→ Per-group INT8 quantization ──→ Even/Odd split
│
4-bit weights ──→ Nibble unpack ──────────────────────┤
▼
PMADDUBSW (256-bit AVX2)
32 multiply-adds per clock
│
PMADDWD (pair reduction)
│
Float accumulate + scale
│
Horizontal sum ──→ Output
Key optimizations:
- Zero port-5 inner loop — Even/odd column split eliminates all shuffle instructions
- PMADDUBSW — 32 simultaneous multiply-adds per 256-bit instruction
- Hardware F16C —
_mm_cvtph_psfor scale/zero conversion - Fused projections — QKV and gate+up share input preprocessing via
rayon::join - Adaptive prefetching — 3-row lookahead for small matrices, 1-row for large
cargo build --release# Quantize a HuggingFace model to .raimodel format
cargo run --release -p rai-compress --bin compress -- \
--model path/to/model \
--output model.raimodel \
--bits 4# Interactive chat
cargo run --release -p rai-infer --bin rai-chat -- \
--model model.raimodel
# Text generation
cargo run --release -p rai-infer --bin rai-generate -- \
--model model.raimodel \
--prompt "The future of computing is"cargo run --release -p rai-server| Requirement | Details |
|---|---|
| Rust | 1.70+ (edition 2021) |
| CPU | x86_64 with AVX2 + FMA + F16C (Intel Haswell+ / AMD Zen+) |
| RAM | Depends on model size (4-bit: ~4GB for 7B params) |
| OS | Linux, macOS, Windows (WSL2) |
| Binary | Description |
|---|---|
rai-generate |
Text generation CLI |
rai-chat |
Interactive chat interface |
profile-fwd |
Forward pass profiler (layer-by-layer timing) |
bw-bench |
Memory bandwidth benchmark |
MIT License — see LICENSE for details.
Built from scratch in Rust • Every SIMD instruction placed by hand • No GPU required