You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CCE's compression layer (chunk summarization for retrieval) currently runs through a local Ollama install with phi3:mini (3.8B params). That setup works, but two real pain points:
Setup friction — users without Ollama installed silently fall back to truncation-only compression, missing one of the bigger savings layers.
Quality ceiling — phi3:mini sometimes paraphrases incorrectly, drops error-handling branches, or hallucinates type signatures, which degrades downstream retrieval relevance.
Adding OpenRouter as an alternative backend lets users with an API key skip the Ollama install entirely and pick a stronger model (Claude Haiku 4.5, GPT-4o-mini, Llama-3.1-70B, etc.) without CCE having to maintain a per-provider client for each.
Scope
What's in
New src/context_engine/compression/openrouter_client.py mirroring the OllamaClient interface (same summarize(prompt, model) -> str shape).
Extract a minimal LLMClient protocol so Compressor can hold either backend without conditional logic at every call site.
Config additions in .context-engine.yaml / ~/.cce/config.yaml:
compression:
provider: openrouter # ollama (default) | openroutermodel: anthropic/claude-haiku-4-5api_key: ${OPENROUTER_API_KEY} # env var preferredbase_url: https://openrouter.ai/api/v1 # override for proxies
Env var OPENROUTER_API_KEY overrides the config field (matches the CCE_OLLAMA_URL pattern).
cce status reports the active provider, model, and (for OpenRouter) whether the API key is set.
Tests with a stubbed HTTP client mirroring the Ollama test pattern.
Docs: new "Compression backends" section in docs/wiki/Configuration.md covering setup, model picks, and cost-per-reindex.
What's out
Embeddings via OpenRouter — OpenRouter routes chat completions, not embeddings. Different feature, different providers (Voyage AI, OpenAI text-embedding-3, Cohere). Track separately.
Output compression — that's a different layer (Claude's responses, not CCE's chunks); unaffected.
Retrieval reranking via LLM — possible follow-up, but out of scope here.
Honest tradeoffs
Compression ratio: unchanged (~90%) — that's mostly truncation + structured summarization, not the model's reasoning power.
Quality (relevance fidelity): estimated 5–15% better recall on harder queries with Haiku/GPT-4o-mini vs. phi3:mini. Estimate is a guess until we run the recall benchmark below; should not be cited as a number until measured.
Latency:slower than local Ollama. phi3 local: ~50–100ms/chunk on M-series. OpenRouter Haiku: ~150–400ms/chunk including network. A 10k-chunk first-index goes from ~10 min to ~20–60 min.
Cost (one-time index of ~10k chunks, ~5M input tokens):
Haiku 4.5 via OpenRouter: ~$5
GPT-4o-mini via OpenRouter: ~$0.75
Incremental reindexes (per-commit): cents
The real win is adoption — users with an existing API key can skip the Ollama install. The quality bump is genuine but modest.
Pre-implementation: recall benchmark
Before merging, run a small A/B on a fixed corpus (suggest fastapi, already a benchmark target):
Bucket A: phi3:mini compression
Bucket B: Haiku 4.5 via OpenRouter
Bucket C: GPT-4o-mini via OpenRouter
Same query set as scripts/bench_recall.py
Report MRR, top-5 recall, and average compression latency per chunk
This puts real numbers in the wiki page so users can pick informed.
Test plan
OpenRouterClient.summarize() returns the model's text response, gracefully handles 4xx (bad key, model not found) with a clear error
Motivation
CCE's compression layer (chunk summarization for retrieval) currently runs through a local Ollama install with
phi3:mini(3.8B params). That setup works, but two real pain points:Adding OpenRouter as an alternative backend lets users with an API key skip the Ollama install entirely and pick a stronger model (Claude Haiku 4.5, GPT-4o-mini, Llama-3.1-70B, etc.) without CCE having to maintain a per-provider client for each.
Scope
What's in
src/context_engine/compression/openrouter_client.pymirroring theOllamaClientinterface (samesummarize(prompt, model) -> strshape).LLMClientprotocol soCompressorcan hold either backend without conditional logic at every call site..context-engine.yaml/~/.cce/config.yaml:OPENROUTER_API_KEYoverrides the config field (matches theCCE_OLLAMA_URLpattern).cce statusreports the active provider, model, and (for OpenRouter) whether the API key is set.docs/wiki/Configuration.mdcovering setup, model picks, and cost-per-reindex.What's out
Honest tradeoffs
Compression ratio: unchanged (~90%) — that's mostly truncation + structured summarization, not the model's reasoning power.
Quality (relevance fidelity): estimated 5–15% better recall on harder queries with Haiku/GPT-4o-mini vs. phi3:mini. Estimate is a guess until we run the recall benchmark below; should not be cited as a number until measured.
Latency: slower than local Ollama. phi3 local: ~50–100ms/chunk on M-series. OpenRouter Haiku: ~150–400ms/chunk including network. A 10k-chunk first-index goes from ~10 min to ~20–60 min.
Cost (one-time index of ~10k chunks, ~5M input tokens):
The real win is adoption — users with an existing API key can skip the Ollama install. The quality bump is genuine but modest.
Pre-implementation: recall benchmark
Before merging, run a small A/B on a fixed corpus (suggest fastapi, already a benchmark target):
scripts/bench_recall.pyThis puts real numbers in the wiki page so users can pick informed.
Test plan
OpenRouterClient.summarize()returns the model's text response, gracefully handles 4xx (bad key, model not found) with a clear errorOpenRouterClientretries transient 5xx / network errors with backoff (mirror Ollama client behavior)Compressorround-trips a chunk through either backend based oncompression.providerOPENROUTER_API_KEYfalls back to truncation-only with a one-time log warning, not a hard error (matches Ollama-not-running behavior)cce statusshows the active provider correctly for both backendsRelated
src/context_engine/compression/ollama_client.py,src/context_engine/compression/compressor.py860cc1e): same pattern of "env var > config > default"48bd407): the framework that should produce the recall numbers