⚡ RAI

CPU-Native LLM Inference Engine

Run quantized large language models on your CPU at production speed.
Hand-written SIMD kernels. Zero GPU required. Pure Rust.

🔥 What Makes RAI Different

RAI isn't a wrapper around PyTorch or GGML. Every critical path is hand-written from scratch in Rust:

Feature	Details
W4A8 SIMD GEMM	2,000+ lines of hand-tuned AVX2+FMA+F16C assembly. PMADDUBSW zero-port5 inner loop, Schraudolph fast-exp, L1 prefetching
4-bit Quantized Inference	Native W4A32 and W4A8 matrix multiplication — no dequantization to f32
Full Transformer Stack	RMSNorm, RoPE, GQA Attention, SwiGLU MLP — all SIMD-accelerated
Speculative Decoding	Both standard and self-speculative (layer-skipping) modes
GPTQ Compression	Built-in pipeline: calibrate → quantize → pack → run
Zero Dependencies on GPU	Runs on any x86_64 CPU with AVX2. No CUDA, no ROCm, no Metal

🏗 Architecture

RAI is organized as a Rust workspace with 4 crates:

rai/
├── rai-core/          Core types, embeddings, memory, reasoning
├── rai-infer/         Inference engine (the SIMD kernels live here)
│   ├── gemm.rs        W4A8 GEMM — AVX2 SIMD inner loops (2,000+ lines)
│   ├── layers.rs      RMSNorm, RoPE, GQA attention, SwiGLU MLP
│   ├── model.rs       Model loader (.raimodel format)
│   ├── kv_cache.rs    KV cache management
│   ├── sampler.rs     Temperature, top-k, top-p, min-p sampling
│   ├── format.rs      Binary model format parser
│   ├── speculative.rs Standard speculative decoding
│   ├── self_speculative.rs  Self-speculative (layer-skipping) decoding
│   ├── ponder.rs      Adaptive compute / pondering
│   └── chat_template.rs  Chat template formatting
├── rai-compress/      Quantization & compression pipeline
│   ├── gptq.rs        GPTQ quantization algorithm
│   ├── compress.rs    Model compression orchestrator
│   ├── quantize.rs    Weight quantization utilities
│   ├── bitpack.rs     4-bit nibble packing
│   ├── hrc.rs         Hierarchical residual compression
│   ├── sac.rs         Structured adaptive compression
│   └── sparse.rs      Weight sparsification
└── rai-server/        API server with MCP support

⚡ The Kernel

The GEMM kernel is the heart of RAI. Here's what makes it fast:

Input (f32) ──→ Per-group INT8 quantization ──→ Even/Odd split
                                                     │
4-bit weights ──→ Nibble unpack ──────────────────────┤
                                                     ▼
                                    PMADDUBSW (256-bit AVX2)
                                    32 multiply-adds per clock
                                              │
                                    PMADDWD (pair reduction)
                                              │
                                    Float accumulate + scale
                                              │
                                    Horizontal sum ──→ Output

Key optimizations:

Zero port-5 inner loop — Even/odd column split eliminates all shuffle instructions
PMADDUBSW — 32 simultaneous multiply-adds per 256-bit instruction
Hardware F16C — _mm_cvtph_ps for scale/zero conversion
Fused projections — QKV and gate+up share input preprocessing via rayon::join
Adaptive prefetching — 3-row lookahead for small matrices, 1-row for large

🚀 Quick Start

Build

cargo build --release

Convert a Model

# Quantize a HuggingFace model to .raimodel format
cargo run --release -p rai-compress --bin compress -- \
    --model path/to/model \
    --output model.raimodel \
    --bits 4

Run Inference

# Interactive chat
cargo run --release -p rai-infer --bin rai-chat -- \
    --model model.raimodel

# Text generation
cargo run --release -p rai-infer --bin rai-generate -- \
    --model model.raimodel \
    --prompt "The future of computing is"

Start API Server

cargo run --release -p rai-server

🛠 Requirements

Requirement	Details
Rust	1.70+ (edition 2021)
CPU	x86_64 with AVX2 + FMA + F16C (Intel Haswell+ / AMD Zen+)
RAM	Depends on model size (4-bit: ~4GB for 7B params)
OS	Linux, macOS, Windows (WSL2)

📊 Binaries

Binary	Description
`rai-generate`	Text generation CLI
`rai-chat`	Interactive chat interface
`profile-fwd`	Forward pass profiler (layer-by-layer timing)
`bw-bench`	Memory bandwidth benchmark

📜 License

MIT License — see LICENSE for details.

_{Built from scratch in Rust • Every SIMD instruction placed by hand • No GPU required}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.cargo		.cargo
rai-compress		rai-compress
rai-core		rai-core
rai-infer		rai-infer
rai-server		rai-server
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
calibrate.py		calibrate.py
gptq_calibration_compression_blueprint_5ac1de7.txt		gptq_calibration_compression_blueprint_5ac1de7.txt
gptq_full_results_report_3c95038.txt		gptq_full_results_report_3c95038.txt
gptq_speak.py		gptq_speak.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚡ RAI

CPU-Native LLM Inference Engine

🔥 What Makes RAI Different

🏗 Architecture

⚡ The Kernel

🚀 Quick Start

Build

Convert a Model

Run Inference

Start API Server

🛠 Requirements

📊 Binaries

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

⚡ RAI

CPU-Native LLM Inference Engine

🔥 What Makes RAI Different

🏗 Architecture

⚡ The Kernel

🚀 Quick Start

Build

Convert a Model

Run Inference

Start API Server

🛠 Requirements

📊 Binaries

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages