Skip to content

Add OpenRouter as a compression backend (Claude Haiku, GPT-4o-mini, etc.) #50

@fazleelahhee

Description

@fazleelahhee

Motivation

CCE's compression layer (chunk summarization for retrieval) currently runs through a local Ollama install with phi3:mini (3.8B params). That setup works, but two real pain points:

  1. Setup friction — users without Ollama installed silently fall back to truncation-only compression, missing one of the bigger savings layers.
  2. Quality ceiling — phi3:mini sometimes paraphrases incorrectly, drops error-handling branches, or hallucinates type signatures, which degrades downstream retrieval relevance.

Adding OpenRouter as an alternative backend lets users with an API key skip the Ollama install entirely and pick a stronger model (Claude Haiku 4.5, GPT-4o-mini, Llama-3.1-70B, etc.) without CCE having to maintain a per-provider client for each.

Scope

What's in

  • New src/context_engine/compression/openrouter_client.py mirroring the OllamaClient interface (same summarize(prompt, model) -> str shape).
  • Extract a minimal LLMClient protocol so Compressor can hold either backend without conditional logic at every call site.
  • Config additions in .context-engine.yaml / ~/.cce/config.yaml:
    compression:
      provider: openrouter            # ollama (default) | openrouter
      model: anthropic/claude-haiku-4-5
      api_key: ${OPENROUTER_API_KEY}  # env var preferred
      base_url: https://openrouter.ai/api/v1  # override for proxies
  • Env var OPENROUTER_API_KEY overrides the config field (matches the CCE_OLLAMA_URL pattern).
  • cce status reports the active provider, model, and (for OpenRouter) whether the API key is set.
  • Tests with a stubbed HTTP client mirroring the Ollama test pattern.
  • Docs: new "Compression backends" section in docs/wiki/Configuration.md covering setup, model picks, and cost-per-reindex.

What's out

  • Embeddings via OpenRouter — OpenRouter routes chat completions, not embeddings. Different feature, different providers (Voyage AI, OpenAI text-embedding-3, Cohere). Track separately.
  • Output compression — that's a different layer (Claude's responses, not CCE's chunks); unaffected.
  • Retrieval reranking via LLM — possible follow-up, but out of scope here.

Honest tradeoffs

Compression ratio: unchanged (~90%) — that's mostly truncation + structured summarization, not the model's reasoning power.

Quality (relevance fidelity): estimated 5–15% better recall on harder queries with Haiku/GPT-4o-mini vs. phi3:mini. Estimate is a guess until we run the recall benchmark below; should not be cited as a number until measured.

Latency: slower than local Ollama. phi3 local: ~50–100ms/chunk on M-series. OpenRouter Haiku: ~150–400ms/chunk including network. A 10k-chunk first-index goes from ~10 min to ~20–60 min.

Cost (one-time index of ~10k chunks, ~5M input tokens):

  • Haiku 4.5 via OpenRouter: ~$5
  • GPT-4o-mini via OpenRouter: ~$0.75
  • Incremental reindexes (per-commit): cents

The real win is adoption — users with an existing API key can skip the Ollama install. The quality bump is genuine but modest.

Pre-implementation: recall benchmark

Before merging, run a small A/B on a fixed corpus (suggest fastapi, already a benchmark target):

  • Bucket A: phi3:mini compression
  • Bucket B: Haiku 4.5 via OpenRouter
  • Bucket C: GPT-4o-mini via OpenRouter
  • Same query set as scripts/bench_recall.py
  • Report MRR, top-5 recall, and average compression latency per chunk

This puts real numbers in the wiki page so users can pick informed.

Test plan

  • OpenRouterClient.summarize() returns the model's text response, gracefully handles 4xx (bad key, model not found) with a clear error
  • OpenRouterClient retries transient 5xx / network errors with backoff (mirror Ollama client behavior)
  • Compressor round-trips a chunk through either backend based on compression.provider
  • Missing OPENROUTER_API_KEY falls back to truncation-only with a one-time log warning, not a hard error (matches Ollama-not-running behavior)
  • cce status shows the active provider correctly for both backends
  • Recall benchmark numbers committed alongside the code change, not after

Related

  • Existing Ollama abstraction: src/context_engine/compression/ollama_client.py, src/context_engine/compression/compressor.py
  • Configurable Ollama URL precedent (issue Possibility to configure external Ollama server #22, commit 860cc1e): same pattern of "env var > config > default"
  • 7-layer benchmark (commit 48bd407): the framework that should produce the recall numbers

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions