A Rust CLI tool for evaluating embedding models on code search tasks. Compares FastEmbed (ONNX/CPU) and MistralRS (Candle/Metal GPU) backends using file-level and answer-level retrieval metrics.
This benchmark was conducted to select an appropriate embedding model for a personal project building local RAG capabilities for AI coding agents. It is not designed to evaluate all embedding models comprehensively, but rather focuses on which models work best for real-world codebases and technical documentation in local development environments.
Key evaluation criteria:
- Retrieval quality for code snippets and technical documentation
- Local performance (memory usage and inference speed)
- Model size vs. accuracy tradeoffs for different hardware constraints
While not academically rigorous, the benchmark aims to provide objective metrics for practical decision-making when choosing embedding models for code search and semantic search features.
Note: Results focus on code/documentation retrieval scenarios. General-purpose embedding tasks may favor different models.
See the full benchmark report: docs/public_embedding_benchmark_report.md
TL;DR: EmbeddingGemma-300M Q8 achieved the best balance of quality (0.891 nDCG@10) and speed (33.5 emb/sec) on Apple Silicon.
- Rust 1.70+
- One of the supported hardware configurations (see below)
This benchmark supports two embedding backends with different hardware acceleration options:
FastEmbed uses ONNX Runtime via the ort crate:
| Platform | Execution Provider | Cargo Feature |
|---|---|---|
| NVIDIA GPU | CUDA | ort = { version = "2", features = ["cuda"] } |
| NVIDIA GPU | TensorRT | ort = { version = "2", features = ["tensorrt"] } |
| Apple Silicon | CoreML | ort = { version = "2", features = ["coreml"] } |
| Windows GPU | DirectML | ort = { version = "2", features = ["directml"] } |
| Generic CPU | Default | (no extra features needed) |
To enable GPU acceleration for FastEmbed, add ort features to Cargo.toml:
# Add ort with your desired execution provider
ort = { version = "2", features = ["cuda"] } # For NVIDIA GPUThen configure execution providers in code or use the default CPU fallback.
Mistral.rs uses Candle for inference:
| Platform | Backend | Device Config | Build Features |
|---|---|---|---|
| NVIDIA GPU | CUDA | device = "cuda" |
cuda, flash-attn, cudnn |
| Apple Silicon | Metal | device = "metal" |
metal |
| Intel CPU | MKL | device = "cpu" |
mkl |
| Apple CPU | Accelerate | device = "cpu" |
accelerate |
| Generic CPU | AVX/NEON | device = "cpu" |
(default) |
The default build uses Metal (macOS). To build for other platforms, modify Cargo.toml:
# MistralRS for NVIDIA GPU (Linux/Windows)
mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git", features = ["cuda"] }
# MistralRS with Flash Attention (recommended for best performance)
mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git", features = ["cuda", "flash-attn", "cudnn"] }
# MistralRS for Intel CPU with MKL
mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git", features = ["mkl"] }
# MistralRS for Apple Silicon (default)
mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git", features = ["metal"] }
# FastEmbed with CUDA (optional, for GPU-accelerated ONNX models)
ort = { version = "2", features = ["cuda"] }Then configure your device in models.toml:
[[mistralrs]]
model = "embedding-gemma"
quantization = "q8"
device = "cuda" # or "metal", "cpu"Test Environment: Our benchmark results were obtained on MacBook Pro M1 Max (64GB RAM) using the Metal backend.
cargo build --releaseClone a codebase into codebases/ directory:
git clone https://github.com/BurntSushi/ripgrep codebases/ripgrepCreate a query file (e.g., queries/ripgrep.json):
{
"metadata": {
"name": "ripgrep",
"description": "Queries for ripgrep codebase",
"version": "1.0"
},
"queries": [
{
"query": "regex matcher implementation",
"query_type": "Code",
"difficulty": "Medium",
"expected_files": ["crates/matcher/src/lib.rs"],
"description": "Find the regex matcher trait"
}
]
}cargo run --release -- run \
--corpus codebases/ripgrep \
--queries queries/ripgrep.json \
--strategy line \
--output results/benchmark.json# List available models
cargo run --release -- list
# Validate query file
cargo run --release -- validate-queries --queries queries/ripgrep.json
# Test a single model
cargo run --release -- test embedding-gemma --backend mistralrs --quantization q4kConfigure which models to benchmark:
# FastEmbed (ONNX/CPU) - works on all platforms
[[fastembed]]
model = "BAAI/bge-small-en-v1.5"
# MistralRS with GPU acceleration
[[mistralrs]]
model = "embedding-gemma"
quantization = "q8" # none, q8, q4k
device = "metal" # "metal" (macOS), "cuda" (NVIDIA), "cpu"Available MistralRS models:
embedding-gemma- Google's Embedding Gemma 300M (768 dims, ~0.6GB)qwen3-0.6b- Alibaba Qwen3 Embedding 0.6B (1024 dims, ~1.2GB)qwen3-4b- Alibaba Qwen3 Embedding 4B (2560 dims, ~8GB)qwen3-8b- Alibaba Qwen3 Embedding 8B (4096 dims, ~16GB)
--strategy line- Fixed line-count chunks with overlap (baseline)--strategy semantic- Structure-aware: AST-based for code, header-based for docs
embedding-benchmark/
├── src/
│ ├── main.rs # CLI entry point
│ ├── embedders/ # Backend implementations
│ │ ├── fastembed_backend.rs # ONNX Runtime (CPU)
│ │ └── mistralrs_backend.rs # Candle + Metal (GPU)
│ ├── corpus/ # Chunking strategies
│ │ ├── chunker.rs # Line-based chunking
│ │ └── semantic_chunker.rs # AST-based chunking
│ ├── benchmark/ # Metrics (nDCG, MRR, Hit@K)
│ └── queries/ # Query file format
├── models.toml # Model configurations
├── queries/ # Query files (JSON)
├── codebases/ # Test codebases (gitignored)
└── docs/ # Benchmark reports
- Chunk the codebase - Split source files into smaller pieces using a chunking strategy
- Embed all chunks - Convert each chunk into a vector using an embedding model
- Embed queries - Convert search queries into vectors
- Retrieve & rank - Find chunks most similar to each query using cosine similarity
- Evaluate - Compare retrieved results against ground truth using IR metrics
We use standard Information Retrieval metrics to measure search quality:
Measures ranking quality by considering both relevance and position. Higher-ranked relevant results contribute more to the score.
DCG@k = Σ (relevance_i / log2(i + 1)) for i = 1 to k
nDCG@k = DCG@k / IDCG@k (normalized by ideal ranking)
- Score range: 0.0 to 1.0 (higher is better)
- Why it matters: Penalizes relevant results appearing lower in the ranking
- Reference: Wikipedia - nDCG
Measures how quickly you find the first relevant result.
RR = 1 / rank_of_first_relevant_result
MRR = average RR across all queries
- Score range: 0.0 to 1.0 (higher is better)
- Example: If first relevant result is at rank 3, RR = 1/3 = 0.333
- Reference: Wikipedia - MRR
Binary metric: did a relevant result appear in the top K results?
- Hit@1: Was the top result relevant? (precision-focused)
- Hit@5: Was there a relevant result in top 5? (recall-focused)
- Score range: 0% to 100%
The chunking strategy determines how source files are split into embeddable pieces.
Simple approach: split files by line count with overlap.
Parameters:
- Code files: 50 lines per chunk
- Doc files: 100 lines per chunk
- Overlap: 25%
Example output:
[file: src/config.rs]
use std::path::Path;
use serde::{Deserialize, Serialize};
pub struct Config {
pub model: String,
pub max_tokens: usize,
}
// ... (up to 50 lines)
Pros: Simple, predictable, language-agnostic Cons: May split functions mid-body, loses semantic context
AST-based chunking that respects code structure and injects context.
For code files:
- Parses source into AST using tree-sitter
- Chunks at semantic boundaries (functions, classes, impl blocks)
- Injects parent context (class name, module docs) into each chunk
Example output:
[file: src/auth/middleware.rs] Authentication middleware
[context: impl AuthMiddleware > fn verify_token]
fn verify_token(&self, token: &str) -> Result<Claims> {
let decoded = jsonwebtoken::decode(token, &self.key)?;
Ok(decoded.claims)
}
For documentation (Markdown):
- Splits at header boundaries (##, ###)
- Injects header breadcrumb trail for context
Example output:
[file: docs/api.md]
[section: # API Reference > ## Authentication > ### JWT Tokens]
JWT tokens are issued by the `/auth/login` endpoint...
Pros: Preserves semantic units, self-contained chunks with context Cons: Requires language-specific parsers, more complex
We evaluate at two granularity levels:
"Did we find the right file?"
- Aggregates chunk scores to file scores (max score per file)
- Compares against expected files from ground truth
- Metrics: nDCG@10, MRR, Recall@10, Precision@1
"Did we find the exact code location?"
- Evaluates whether retrieved chunks contain the actual answer
- Uses keyword matching and line range overlap
- Metrics: Hit@1, Hit@3, Hit@5, MRR@10
Why both? File-level shows overall navigation accuracy. Answer-level shows whether the embedding captured the specific code semantics.
| Level | Metric | Description |
|---|---|---|
| File | nDCG@10 | Ranking quality of relevant files |
| File | MRR | Reciprocal rank of first relevant file |
| Answer | Hit@K | Found answer chunk in top-K results |
| Answer | MRR@10 | Reciprocal rank of answer chunk |
For detailed metric definitions, see Methodology above.
- Hardware-specific results: Benchmark results were obtained on Apple Silicon M1 Max with Metal backend. Performance (throughput, memory usage) will differ on NVIDIA GPUs with CUDA or CPU-only configurations. Quality metrics (nDCG, MRR) should be consistent across platforms.
- Model availability: Models are downloaded from Hugging Face on first run. Ensure sufficient disk space (~1-16GB depending on model) and network access.
- Not production-ready: This is a benchmarking tool for research purposes. The embedding backends and chunking strategies are not optimized for production use.
- Query bias: Benchmark quality depends heavily on query design. Results may not generalize to different query styles or codebases.
- Quantization trade-offs: While Q8/Q4K quantization showed minimal quality loss in our tests, this may vary for other use cases.
- nDCG - Normalized Discounted Cumulative Gain - Standard IR ranking metric
- MRR - Mean Reciprocal Rank - First relevant result metric
- BEIR Benchmark - Heterogeneous benchmark for information retrieval
- MTEB Leaderboard - Massive Text Embedding Benchmark
- FastEmbed - ONNX-based embedding library
- Mistral.rs - Rust inference engine with Metal support
- Tree-sitter - Incremental parsing library for code
MIT