Skip to content

shouvik12/trooper

Repository files navigation

NEW: 4-agent privacy routing demo →

🪖 Trooper

Your LLM didn't crash — it fell back and kept going.

Quota errors should be invisible.

As LLM APIs get rate-limited and expensive, local fallback isn't optional anymore.

→ Claude fails       → continues on Ollama
→ Simple prompts     → never hit the cloud
→ Every response     → shows tokens saved

Trooper is a circuit breaker + router + context engine for LLMs.

image

What you see

Every response tells you exactly what happened — no dashboards, no setup:

# Simple question → Ollama handled it, cloud never contacted
X-Trooper-Provider: ollama
X-Trooper-Decision: ollama (simple turn) | cloud skipped
X-Trooper-Session-Saved: 42 tokens

# Complex question → Claude handled it
X-Trooper-Provider: claude
X-Trooper-Summary: claude (direct) ✓

# Claude quota hit → fell back to Ollama, context preserved
X-Trooper-Provider: ollama
X-Trooper-Decision: ollama (fallback: credit_balance)
X-Trooper-Session-Saved: 42 tokens
X-Trooper-Summary: claude → ollama (credit_balance) | context ✓

X-Trooper-Session-Saved accumulates across the session — every turn routed locally instead of to a paid API adds to the count.


What Trooper is

Trooper is a drop-in proxy for LLM apps. When cloud models fail — quota, rate limits, outages — it automatically falls back to your local Ollama instance while preserving full conversation context.

No retries. No crashes. No lost sessions. ⏱ Runs in under 60 seconds.


Who uses Trooper

App developers — your users never see quota errors. Trooper fails over to local Ollama transparently while your app keeps running.

Agent builders — agent loops survive quota limits mid-task. Context is preserved so the agent continues exactly where it left off.

Claude Code / Cursor users — coding sessions survive quota hits. No lost context, no starting over.

Privacy-conscious developers — use x_force_local to keep sensitive requests off the cloud without interrupting the session.


Why not LiteLLM or Bifrost

LiteLLM and Bifrost route between cloud providers.

Trooper is built for a different failure mode: when the cloud stops working.

LiteLLM / Bifrost Trooper
Fallback target Another cloud provider Your local machine
Setup pip install, venv, YAML One Go binary, env vars
Dependencies Heavy Python stack Zero — pure stdlib
Works offline
Data on fallback Goes to another cloud Stays on your machine

When LiteLLM falls back, your data goes to another cloud. When Trooper falls back, your data goes to your machine.


Smart routing

Trooper decides when the cloud is overkill.

The classifier is rule-based and deterministic — no LLM call, no latency, no cost to classify. Most routing tools call an LLM to decide routing. Trooper doesn't.

Simple, stateless requests route directly to your local Ollama — no API call, no cost:

"how many days in a week"  →  Ollama directly 🪖  (cloud never contacted)
"explain why goroutines…"  →  Claude ✅           (needs reasoning)

Routes to Ollama: factual lookups, definitions, formatting, conversation meta, short stateless summaries

Always goes to Claude: reasoning, judgment, multi-step tasks, context-aware summaries, code, messages over 20 words


How Trooper handles context

The hard part of fallback isn't switching models — it's keeping context.

Trooper solves that with a 3-layer compaction system:

ANCHOR  (~10%)  — First 2 turns verbatim, never dropped
SITREP  (~20%)  — Rule-based summary of middle turns
TAIL    (~70%)  — Last N turns verbatim
                  Total <= 6144 tokens (configurable)

The SITREP is extracted automatically — no LLM call needed. From a real session:

[TROOPER_SITREP]{
  "intent": "building a go proxy called trooper that falls back to local",
  "stage": "in_progress",
  "constraints": ["local-first", "proxy-layer"],
  "active_entities": ["Trooper", "Ollama", "Claude"],
  "open_loops": ["streaming pending"],
  "recent_actions": ["deploy monday", "check streaming"],
  "resolved_loops": ["resolve the health check"],
  "confidence": 1.00
}[/TROOPER_SITREP]

Compaction triggers automatically when the session exceeds the token budget:

📦  Context compaction triggered — 1532 tokens exceeds 6144 budget
    Anchor turns   : 2 (~180 tokens)
    Middle turns   : 2 → SITREP (~148 tokens)
    Recent turns   : 1 (~36 tokens)
    Tokens used    : 364 / 6144

Honest note: Compaction is lossy by design. The SITREP preserves intent and state — not verbatim history. For precision-critical workflows, keep sessions short or increase CONTEXT_WINDOW.


Quickstart

⏱ Runs in under 60 seconds.

Prerequisites

ollama pull qwen2.5:3b

💡 Eliminate cold-start latency — set OLLAMA_KEEP_ALIVE=24h in your Ollama systemd service. Without this, the first fallback after idle takes 3–5s for 7B models, up to 20s for 72B. Add to your systemd service:

Environment="OLLAMA_KEEP_ALIVE=24h"

Option 1 — Docker (no Go required)

git clone https://github.com/shouvik12/trooper
cd trooper
cp .env.example .env
# edit .env — set CLAUDE_API_KEY
docker compose up

Option 2 — Run from source (Go 1.22+)

git clone https://github.com/shouvik12/trooper
cd trooper
export CLAUDE_API_KEY=sk-ant-...
go run main.go providers.go classifier.go

Trooper starts on http://127.0.0.1:3000. Binds to localhost by default — your API keys are not exposed on the network.


Usage

Point your existing client at Trooper — nothing else changes:

Python + Anthropic SDK:

import anthropic
client = anthropic.Anthropic(
    api_key="your-key",
    base_url="http://localhost:3000",  # only change
)

Python + OpenAI SDK:

from openai import OpenAI
client = OpenAI(
    api_key="your-key",
    base_url="http://localhost:3000",  # only change
)

curl:

curl http://localhost:3000/ \
  -H "Content-Type: application/json" \
  -H "X-Session-ID: my-session" \
  -d '{"model": "claude-3-5-haiku-20241022", "messages": [{"role": "user", "content": "Hello!"}]}'

Pass X-Session-ID to track named sessions. Without it, Trooper assigns a unique auto session per request.


Provider chain

Trooper builds the chain from environment variables. Ollama is always last.

CLAUDE_API_KEY=sk-ant-...                          # Chain: Claude → Ollama
CLAUDE_API_KEY=sk-ant-...  GEMINI_API_KEY=AIza...  # Chain: Claude → Gemini → Ollama
CLAUDE_API_KEY=sk-ant-...  OPENAI_API_KEY=sk-...   # Chain: Claude → OpenAI → Ollama

Fallback behaviour

Status Trooper action
200 OK Pass through
429 Rate Limited Retry with 2s backoff, then try next
402 Payment Required Fall back immediately
400 Credit Balance Detect credit error, fall back immediately
401 Unauthorized Surface error — bad keys are never masked
529 Overloaded Fall back immediately
Network error Fall back immediately — 30s timeout per provider

Response headers

curl http://localhost:3000/ ... -v 2>&1 | grep X-Trooper

# Simple turn — cloud never contacted
X-Trooper-Provider: ollama
X-Trooper-Decision: ollama (simple turn) | cloud skipped
X-Trooper-Session-Saved: 14 tokens

# Cloud served normally
X-Trooper-Provider: claude
X-Trooper-Fallback-Count: 0
X-Trooper-Summary: claude (direct) ✓

# Quota hit — fell back, context preserved
X-Trooper-Provider: ollama
X-Trooper-Fallback-Count: 1
X-Trooper-Decision: ollama (fallback: credit_balance)
X-Trooper-Session-Saved: 14 tokens
X-Trooper-Summary: claude → ollama (credit_balance) | context ✓

Circuit breaker

If a provider fails 3 times within 60 seconds, Trooper skips it automatically — no wasted round trips. Resets after 60 seconds.

⚡ Skipping claude — circuit open (3 fails in last 60s)
🔄 Trying provider: ollama

Auto recovery

AUTO_RECOVERY=true go run main.go providers.go classifier.go

Health checks use a free GET /models endpoint — no inference requests, no cost. Trooper silently routes back to the primary provider when it recovers.


Per-request local routing

Add x_force_local: true to any request body to route that specific request to Ollama, regardless of complexity or provider availability.

Use for:

  • Privacy — keep sensitive requests off the cloud
  • Cost control — force local for expensive operations
  • Offline mode — bypass cloud entirely mid-session

The session context is preserved. Cloud routing resumes on the next request without the flag.

Example:

# Turn 1 & 2 — Claude handles it (cloud)
curl http://localhost:3000/v1/chat/completions \
  -H "X-Session-ID: dev-session" \
  -d '{"model": "claude-sonnet-4-5", "max_tokens": 1024,
       "messages": [{"role": "user", "content": "Help me design our auth layer"}]}'

# Turn 3 — sensitive detail, developer keeps it local
curl http://localhost:3000/v1/chat/completions \
  -H "X-Session-ID: dev-session" \
  -d '{"model": "claude-sonnet-4-5", "max_tokens": 1024,
       "x_force_local": true,
       "messages": [{"role": "user", "content": "Our payment vault uses..."}]}'

Trooper log on Turn 3:

🔒 Developer requested local-only (x_force_local) — skipping cloud
🔒 Local: ollama (force_local) | privacy mode | session saved: 28 tokens

Running tests

go test ./... -v

Covers: turn classifier, code detection, context compaction, token estimation. All tests must pass before any contribution is merged.


Configuration

Variable Default Description
CLAUDE_API_KEY Anthropic API key
CLAUDE_MODEL Default Claude model
GEMINI_API_KEY Google Gemini API key
GEMINI_MODEL gemini-2.0-flash Default Gemini model
OPENAI_API_KEY OpenAI API key
OPENAI_MODEL gpt-4o-mini Default OpenAI model
OLLAMA_MODEL qwen2.5:3b Local fallback model
FALLBACK_URL http://localhost:11434/api/chat Ollama endpoint
CONTEXT_WINDOW 6144 Token budget for context compaction
QUOTA_STATUS_CODES 429,402,529,400 HTTP codes that trigger fallback
TROOPER_PORT 3000 Port Trooper listens on
TROOPER_BIND 127.0.0.1 Bind address
AUTO_RECOVERY false Enable automatic recovery to primary provider
OLLAMA_KEEP_ALIVE 5m Set 24h in systemd to eliminate cold-start latency

Recommended local models

Model Size Notes
qwen2.5:3b 1.9GB Default — fast, lightweight
qwen2.5:7b 4.7GB Better quality, still fast
llama3.1:8b 4.9GB Strong all-rounder
mistral:7b 4.1GB Good reasoning

Roadmap

V3.1 — Released

  • ✅ Smart routing — simple turns route to Ollama directly, cloud never contacted
  • ✅ X-Trooper-Session-Saved header — cumulative tokens saved per session
  • ✅ X-Trooper-Decision header — routing decision on every response
  • ✅ Deterministic classifier — no LLM call to route, zero added latency

V3.0 — Released

  • ✅ Circuit breaker — skip providers that fail 3x in 60s
  • ✅ Zero-interruption log lines
  • ✅ X-Trooper-Summary header

V2 / V2.2 — Released

  • ✅ Cloud → Ollama fallback with session continuity
  • ✅ Context compaction — Anchor + SITREP + Tail
  • ✅ Streaming, health check, auto recovery, zero dependencies

Recognition

  • Featured in Agent Brief by agentcommunity.org — curated alongside Anthropic, Shopify MCP, and LangGraph updates (April 2026)
  • Featured on @github_unpacked — Instagram reel with 76 saves
  • Featured on PatentLLM — covered alongside Qwen3.6-27B RTX 3090 local inference story (May 2026)
  • Featured on dev.to — local AI tooling roundup (May 2026)
  • Cited by kylebrodeur as inspiration for "robust, transparent HTTP rate-limit fallback triggers"

License

MIT

About

A drop-in proxy that falls back to local Ollama when any LLM quota runs out

Topics

Resources

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors