Skip to content

Ranjitbarnala0/rai

Repository files navigation

Rust MIT CPU Only SIMD

⚡ RAI

CPU-Native LLM Inference Engine

Run quantized large language models on your CPU at production speed.
Hand-written SIMD kernels. Zero GPU required. Pure Rust.


🔥 What Makes RAI Different

RAI isn't a wrapper around PyTorch or GGML. Every critical path is hand-written from scratch in Rust:

Feature Details
W4A8 SIMD GEMM 2,000+ lines of hand-tuned AVX2+FMA+F16C assembly. PMADDUBSW zero-port5 inner loop, Schraudolph fast-exp, L1 prefetching
4-bit Quantized Inference Native W4A32 and W4A8 matrix multiplication — no dequantization to f32
Full Transformer Stack RMSNorm, RoPE, GQA Attention, SwiGLU MLP — all SIMD-accelerated
Speculative Decoding Both standard and self-speculative (layer-skipping) modes
GPTQ Compression Built-in pipeline: calibrate → quantize → pack → run
Zero Dependencies on GPU Runs on any x86_64 CPU with AVX2. No CUDA, no ROCm, no Metal

🏗 Architecture

RAI is organized as a Rust workspace with 4 crates:

rai/
├── rai-core/          Core types, embeddings, memory, reasoning
├── rai-infer/         Inference engine (the SIMD kernels live here)
│   ├── gemm.rs        W4A8 GEMM — AVX2 SIMD inner loops (2,000+ lines)
│   ├── layers.rs      RMSNorm, RoPE, GQA attention, SwiGLU MLP
│   ├── model.rs       Model loader (.raimodel format)
│   ├── kv_cache.rs    KV cache management
│   ├── sampler.rs     Temperature, top-k, top-p, min-p sampling
│   ├── format.rs      Binary model format parser
│   ├── speculative.rs Standard speculative decoding
│   ├── self_speculative.rs  Self-speculative (layer-skipping) decoding
│   ├── ponder.rs      Adaptive compute / pondering
│   └── chat_template.rs  Chat template formatting
├── rai-compress/      Quantization & compression pipeline
│   ├── gptq.rs        GPTQ quantization algorithm
│   ├── compress.rs    Model compression orchestrator
│   ├── quantize.rs    Weight quantization utilities
│   ├── bitpack.rs     4-bit nibble packing
│   ├── hrc.rs         Hierarchical residual compression
│   ├── sac.rs         Structured adaptive compression
│   └── sparse.rs      Weight sparsification
└── rai-server/        API server with MCP support

⚡ The Kernel

The GEMM kernel is the heart of RAI. Here's what makes it fast:

Input (f32) ──→ Per-group INT8 quantization ──→ Even/Odd split
                                                     │
4-bit weights ──→ Nibble unpack ──────────────────────┤
                                                     ▼
                                    PMADDUBSW (256-bit AVX2)
                                    32 multiply-adds per clock
                                              │
                                    PMADDWD (pair reduction)
                                              │
                                    Float accumulate + scale
                                              │
                                    Horizontal sum ──→ Output

Key optimizations:

  • Zero port-5 inner loop — Even/odd column split eliminates all shuffle instructions
  • PMADDUBSW — 32 simultaneous multiply-adds per 256-bit instruction
  • Hardware F16C_mm_cvtph_ps for scale/zero conversion
  • Fused projections — QKV and gate+up share input preprocessing via rayon::join
  • Adaptive prefetching — 3-row lookahead for small matrices, 1-row for large

🚀 Quick Start

Build

cargo build --release

Convert a Model

# Quantize a HuggingFace model to .raimodel format
cargo run --release -p rai-compress --bin compress -- \
    --model path/to/model \
    --output model.raimodel \
    --bits 4

Run Inference

# Interactive chat
cargo run --release -p rai-infer --bin rai-chat -- \
    --model model.raimodel

# Text generation
cargo run --release -p rai-infer --bin rai-generate -- \
    --model model.raimodel \
    --prompt "The future of computing is"

Start API Server

cargo run --release -p rai-server

🛠 Requirements

Requirement Details
Rust 1.70+ (edition 2021)
CPU x86_64 with AVX2 + FMA + F16C (Intel Haswell+ / AMD Zen+)
RAM Depends on model size (4-bit: ~4GB for 7B params)
OS Linux, macOS, Windows (WSL2)

📊 Binaries

Binary Description
rai-generate Text generation CLI
rai-chat Interactive chat interface
profile-fwd Forward pass profiler (layer-by-layer timing)
bw-bench Memory bandwidth benchmark

📜 License

MIT License — see LICENSE for details.


Built from scratch in Rust • Every SIMD instruction placed by hand • No GPU required

About

CPU-native LLM inference engine — hand-written SIMD kernels, 4-bit quantized, zero GPU required. Pure Rust.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors