NEW: 4-agent privacy routing demo →
Your LLM didn't crash — it fell back and kept going.
Quota errors should be invisible.
As LLM APIs get rate-limited and expensive, local fallback isn't optional anymore.
→ Claude fails → continues on Ollama
→ Simple prompts → never hit the cloud
→ Every response → shows tokens saved
Trooper is a circuit breaker + router + context engine for LLMs.
Every response tells you exactly what happened — no dashboards, no setup:
# Simple question → Ollama handled it, cloud never contacted
X-Trooper-Provider: ollama
X-Trooper-Decision: ollama (simple turn) | cloud skipped
X-Trooper-Session-Saved: 42 tokens
# Complex question → Claude handled it
X-Trooper-Provider: claude
X-Trooper-Summary: claude (direct) ✓
# Claude quota hit → fell back to Ollama, context preserved
X-Trooper-Provider: ollama
X-Trooper-Decision: ollama (fallback: credit_balance)
X-Trooper-Session-Saved: 42 tokens
X-Trooper-Summary: claude → ollama (credit_balance) | context ✓X-Trooper-Session-Saved accumulates across the session — every turn routed locally instead of to a paid API adds to the count.
Trooper is a drop-in proxy for LLM apps. When cloud models fail — quota, rate limits, outages — it automatically falls back to your local Ollama instance while preserving full conversation context.
No retries. No crashes. No lost sessions. ⏱ Runs in under 60 seconds.
App developers — your users never see quota errors. Trooper fails over to local Ollama transparently while your app keeps running.
Agent builders — agent loops survive quota limits mid-task. Context is preserved so the agent continues exactly where it left off.
Claude Code / Cursor users — coding sessions survive quota hits. No lost context, no starting over.
Privacy-conscious developers — use x_force_local to keep
sensitive requests off the cloud without interrupting the session.
LiteLLM and Bifrost route between cloud providers.
Trooper is built for a different failure mode: when the cloud stops working.
| LiteLLM / Bifrost | Trooper | |
|---|---|---|
| Fallback target | Another cloud provider | Your local machine |
| Setup | pip install, venv, YAML |
One Go binary, env vars |
| Dependencies | Heavy Python stack | Zero — pure stdlib |
| Works offline | ❌ | ✅ |
| Data on fallback | Goes to another cloud | Stays on your machine |
When LiteLLM falls back, your data goes to another cloud. When Trooper falls back, your data goes to your machine.
Trooper decides when the cloud is overkill.
The classifier is rule-based and deterministic — no LLM call, no latency, no cost to classify. Most routing tools call an LLM to decide routing. Trooper doesn't.
Simple, stateless requests route directly to your local Ollama — no API call, no cost:
"how many days in a week" → Ollama directly 🪖 (cloud never contacted)
"explain why goroutines…" → Claude ✅ (needs reasoning)
Routes to Ollama: factual lookups, definitions, formatting, conversation meta, short stateless summaries
Always goes to Claude: reasoning, judgment, multi-step tasks, context-aware summaries, code, messages over 20 words
The hard part of fallback isn't switching models — it's keeping context.
Trooper solves that with a 3-layer compaction system:
ANCHOR (~10%) — First 2 turns verbatim, never dropped
SITREP (~20%) — Rule-based summary of middle turns
TAIL (~70%) — Last N turns verbatim
Total <= 6144 tokens (configurable)
The SITREP is extracted automatically — no LLM call needed. From a real session:
[TROOPER_SITREP]{
"intent": "building a go proxy called trooper that falls back to local",
"stage": "in_progress",
"constraints": ["local-first", "proxy-layer"],
"active_entities": ["Trooper", "Ollama", "Claude"],
"open_loops": ["streaming pending"],
"recent_actions": ["deploy monday", "check streaming"],
"resolved_loops": ["resolve the health check"],
"confidence": 1.00
}[/TROOPER_SITREP]Compaction triggers automatically when the session exceeds the token budget:
📦 Context compaction triggered — 1532 tokens exceeds 6144 budget
Anchor turns : 2 (~180 tokens)
Middle turns : 2 → SITREP (~148 tokens)
Recent turns : 1 (~36 tokens)
Tokens used : 364 / 6144
Honest note: Compaction is lossy by design. The SITREP preserves intent and state — not verbatim history. For precision-critical workflows, keep sessions short or increase
CONTEXT_WINDOW.
⏱ Runs in under 60 seconds.
ollama pull qwen2.5:3b💡 Eliminate cold-start latency — set
OLLAMA_KEEP_ALIVE=24hin your Ollama systemd service. Without this, the first fallback after idle takes 3–5s for 7B models, up to 20s for 72B. Add to your systemd service:Environment="OLLAMA_KEEP_ALIVE=24h"
git clone https://github.com/shouvik12/trooper
cd trooper
cp .env.example .env
# edit .env — set CLAUDE_API_KEY
docker compose upgit clone https://github.com/shouvik12/trooper
cd trooper
export CLAUDE_API_KEY=sk-ant-...
go run main.go providers.go classifier.goTrooper starts on http://127.0.0.1:3000. Binds to localhost by default — your API keys are not exposed on the network.
Point your existing client at Trooper — nothing else changes:
Python + Anthropic SDK:
import anthropic
client = anthropic.Anthropic(
api_key="your-key",
base_url="http://localhost:3000", # only change
)Python + OpenAI SDK:
from openai import OpenAI
client = OpenAI(
api_key="your-key",
base_url="http://localhost:3000", # only change
)curl:
curl http://localhost:3000/ \
-H "Content-Type: application/json" \
-H "X-Session-ID: my-session" \
-d '{"model": "claude-3-5-haiku-20241022", "messages": [{"role": "user", "content": "Hello!"}]}'Pass X-Session-ID to track named sessions. Without it, Trooper assigns a unique auto session per request.
Trooper builds the chain from environment variables. Ollama is always last.
CLAUDE_API_KEY=sk-ant-... # Chain: Claude → Ollama
CLAUDE_API_KEY=sk-ant-... GEMINI_API_KEY=AIza... # Chain: Claude → Gemini → Ollama
CLAUDE_API_KEY=sk-ant-... OPENAI_API_KEY=sk-... # Chain: Claude → OpenAI → Ollama| Status | Trooper action |
|---|---|
200 OK |
Pass through |
429 Rate Limited |
Retry with 2s backoff, then try next |
402 Payment Required |
Fall back immediately |
400 Credit Balance |
Detect credit error, fall back immediately |
401 Unauthorized |
Surface error — bad keys are never masked |
529 Overloaded |
Fall back immediately |
| Network error | Fall back immediately — 30s timeout per provider |
curl http://localhost:3000/ ... -v 2>&1 | grep X-Trooper
# Simple turn — cloud never contacted
X-Trooper-Provider: ollama
X-Trooper-Decision: ollama (simple turn) | cloud skipped
X-Trooper-Session-Saved: 14 tokens
# Cloud served normally
X-Trooper-Provider: claude
X-Trooper-Fallback-Count: 0
X-Trooper-Summary: claude (direct) ✓
# Quota hit — fell back, context preserved
X-Trooper-Provider: ollama
X-Trooper-Fallback-Count: 1
X-Trooper-Decision: ollama (fallback: credit_balance)
X-Trooper-Session-Saved: 14 tokens
X-Trooper-Summary: claude → ollama (credit_balance) | context ✓If a provider fails 3 times within 60 seconds, Trooper skips it automatically — no wasted round trips. Resets after 60 seconds.
⚡ Skipping claude — circuit open (3 fails in last 60s)
🔄 Trying provider: ollama
AUTO_RECOVERY=true go run main.go providers.go classifier.goHealth checks use a free GET /models endpoint — no inference requests, no cost. Trooper silently routes back to the primary provider when it recovers.
Add x_force_local: true to any request body to route that specific
request to Ollama, regardless of complexity or provider availability.
Use for:
- Privacy — keep sensitive requests off the cloud
- Cost control — force local for expensive operations
- Offline mode — bypass cloud entirely mid-session
The session context is preserved. Cloud routing resumes on the next request without the flag.
Example:
# Turn 1 & 2 — Claude handles it (cloud)
curl http://localhost:3000/v1/chat/completions \
-H "X-Session-ID: dev-session" \
-d '{"model": "claude-sonnet-4-5", "max_tokens": 1024,
"messages": [{"role": "user", "content": "Help me design our auth layer"}]}'
# Turn 3 — sensitive detail, developer keeps it local
curl http://localhost:3000/v1/chat/completions \
-H "X-Session-ID: dev-session" \
-d '{"model": "claude-sonnet-4-5", "max_tokens": 1024,
"x_force_local": true,
"messages": [{"role": "user", "content": "Our payment vault uses..."}]}'Trooper log on Turn 3:
🔒 Developer requested local-only (x_force_local) — skipping cloud
🔒 Local: ollama (force_local) | privacy mode | session saved: 28 tokens
go test ./... -vCovers: turn classifier, code detection, context compaction, token estimation. All tests must pass before any contribution is merged.
| Variable | Default | Description |
|---|---|---|
CLAUDE_API_KEY |
— | Anthropic API key |
CLAUDE_MODEL |
— | Default Claude model |
GEMINI_API_KEY |
— | Google Gemini API key |
GEMINI_MODEL |
gemini-2.0-flash |
Default Gemini model |
OPENAI_API_KEY |
— | OpenAI API key |
OPENAI_MODEL |
gpt-4o-mini |
Default OpenAI model |
OLLAMA_MODEL |
qwen2.5:3b |
Local fallback model |
FALLBACK_URL |
http://localhost:11434/api/chat |
Ollama endpoint |
CONTEXT_WINDOW |
6144 |
Token budget for context compaction |
QUOTA_STATUS_CODES |
429,402,529,400 |
HTTP codes that trigger fallback |
TROOPER_PORT |
3000 |
Port Trooper listens on |
TROOPER_BIND |
127.0.0.1 |
Bind address |
AUTO_RECOVERY |
false |
Enable automatic recovery to primary provider |
OLLAMA_KEEP_ALIVE |
5m |
Set 24h in systemd to eliminate cold-start latency |
| Model | Size | Notes |
|---|---|---|
qwen2.5:3b |
1.9GB | Default — fast, lightweight |
qwen2.5:7b |
4.7GB | Better quality, still fast |
llama3.1:8b |
4.9GB | Strong all-rounder |
mistral:7b |
4.1GB | Good reasoning |
V3.1 — Released
- ✅ Smart routing — simple turns route to Ollama directly, cloud never contacted
- ✅ X-Trooper-Session-Saved header — cumulative tokens saved per session
- ✅ X-Trooper-Decision header — routing decision on every response
- ✅ Deterministic classifier — no LLM call to route, zero added latency
V3.0 — Released
- ✅ Circuit breaker — skip providers that fail 3x in 60s
- ✅ Zero-interruption log lines
- ✅ X-Trooper-Summary header
V2 / V2.2 — Released
- ✅ Cloud → Ollama fallback with session continuity
- ✅ Context compaction — Anchor + SITREP + Tail
- ✅ Streaming, health check, auto recovery, zero dependencies
- Featured in Agent Brief by agentcommunity.org — curated alongside Anthropic, Shopify MCP, and LangGraph updates (April 2026)
- Featured on @github_unpacked — Instagram reel with 76 saves
- Featured on PatentLLM — covered alongside Qwen3.6-27B RTX 3090 local inference story (May 2026)
- Featured on dev.to — local AI tooling roundup (May 2026)
- Cited by kylebrodeur as inspiration for "robust, transparent HTTP rate-limit fallback triggers"
MIT