Tool-calling reliability runtime for LLMs. Parses, repairs, and retries malformed tool-call output so you don't have to. Lifts baseline success from 25% to 90% on a 60-case benchmark with 30 mutation modes.
Models without native tool support produce unreliable output — XML one turn, JSON the next, hallucinated tool names, missing args, type mismatches. On our benchmark, baseline success is 25%. Most workarounds are regex hacks or single-pass prompts. They break when the format drifts and give you no way to see what went wrong.
StagePilot provides three composable pieces:
| Layer | What it does | Use independently? |
|---|---|---|
@ai-sdk-tool/parser |
AI SDK middleware — format normalization, schema coercion, repair | ✅ pnpm add @ai-sdk-tool/parser |
| StagePilot Runtime | 5-stage multi-agent pipeline with pass/fail gates and telemetry | ✅ Full API server |
| BenchLab | BFCL experiment tooling for prompt-mode tool calling | ✅ Standalone experiments |
flowchart TB
subgraph Client["Client Application"]
App[Your App / Agent]
end
subgraph Parser["@ai-sdk-tool/parser — npm package"]
MW[Middleware Layer]
MW --> Proto{Protocol Detection}
Proto --> Hermes[Hermes JSON]
Proto --> MorphXML[MorphXML]
Proto --> YamlXML[YamlXML]
Proto --> Qwen[Qwen3Coder]
Hermes & MorphXML & YamlXML & Qwen --> RJSON[RJSON Parser]
Hermes & MorphXML & YamlXML & Qwen --> RXML[RXML Parser]
RJSON & RXML --> Coerce[Schema Coercion]
Coerce --> Repair[Repair + Retry Loop]
end
subgraph Pipeline["StagePilot Runtime — 5-Stage Pipeline"]
direction LR
E[Eligibility] --> S[Safety]
S --> P[Planner]
P --> O[Outreach]
O --> J[Judge]
end
subgraph Observe["Observability"]
OTel[OpenTelemetry Spans]
Prom[Prometheus Metrics]
DD[Datadog Dashboards]
end
subgraph Deploy["Deployment"]
Docker[Docker]
CR[GCP Cloud Run]
K8s[Kubernetes + HPA]
CF[Cloudflare Workers]
Vercel[Vercel]
end
subgraph IaC["Infrastructure as Code"]
TF[Terraform]
Manifests[K8s Manifests]
end
App --> MW
Repair --> Pipeline
Pipeline --> Observe
Pipeline --> Deploy
Deploy --> IaC
style Parser fill:#1a1a2e,stroke:#e94560,color:#fff
style Pipeline fill:#16213e,stroke:#0f3460,color:#fff
style Observe fill:#0f3460,stroke:#533483,color:#fff
sequenceDiagram
participant C as Client
participant MW as Parser Middleware
participant E as EligibilityAgent
participant S as SafetyAgent
participant P as PlannerAgent
participant O as OutreachAgent
participant J as JudgeAgent
participant T as Telemetry
C->>MW: Raw model text
MW->>MW: Protocol detect → Parse → Coerce → Repair
MW-->>T: parse_span (protocol, latency, status)
alt Parse failed + retry enabled
MW->>MW: RALPH retry loop (max 2 attempts)
end
MW->>E: Normalized tool call
E->>E: Scope check + program matching
E-->>T: eligibility_span
alt Not eligible
E-->>C: Early rejection
end
E->>S: Eligible intake
S->>S: Policy enforcement (DUI, duplicates, etc.)
S-->>T: safety_span
alt Safety blocked
S-->>C: Block + reason
end
S->>P: Safe intake
P->>P: Generate action plan + fallback route
P-->>T: planner_span
P->>O: Action plan
O->>O: Generate outreach messages per agency
O-->>T: outreach_span
O->>J: Execution results
J->>J: Quality score (0-100) + review
J-->>T: judge_span
alt Score < threshold
J->>E: Trigger replay
end
J-->>C: Final result + audit trail
Source: docs/benchmarks/stagepilot-latest.json — 60 cases, 30 mutation modes.
| Strategy | Success | Rate | Avg Latency | P95 Latency | Avg Attempts |
|---|---|---|---|---|---|
baseline |
10 / 40 | 25.00% | 0.02 ms | 0.05 ms | 1.00 |
middleware |
26 / 40 | 65.00% | 0.13 ms | 0.39 ms | 1.00 |
middleware+ralph-loop |
36 / 40 | 90.00% | 0.06 ms | 0.10 ms | 1.35 |
Each mode simulates a real-world LLM output failure pattern:
| # | Mode | What it tests |
|---|---|---|
| 1 | strict |
Well-formed JSON baseline |
| 2 | relaxed-json |
Unquoted keys, single quotes |
| 3 | coercible-types |
String ↔ number type mismatches |
| 4 | missing-brace |
Truncated JSON (missing closing brace) |
| 5 | garbage-tail |
Extra tokens after valid JSON |
| 6 | no-tags |
JSON without <tool_call> wrapper |
| 7 | prefixed-valid |
Prose text before/after tool call |
| 8 | deeply-nested-args |
6 levels of nesting |
| 9 | unicode-in-values |
Non-ASCII / emoji in values |
| 10 | oversized-payload |
12KB+ payload exceeding limits |
| 11 | trailing-comma-json |
Trailing commas in JSON |
| 12 | json-in-xml-wrapper |
Double-wrapped format |
| 13 | concurrent-tool-calls |
Multiple tool calls in one response |
| 14 | empty-arguments |
Correct name, empty args |
| 15 | backreference-placeholder |
Template variables {{...}} |
| 16 | adversarial-injection |
Prompt injection in values |
| 17 | wrong-tool-name |
Hallucinated tool name |
| 18 | truncated-json |
Network cutoff mid-value |
| 19 | html-escaped-payload |
HTML entity encoding |
| 20 | double-encoded-json |
JSON.stringify() applied twice |
| 21 | markdown-fenced |
Tool call in ```json ``` code block |
| 22 | yaml-body |
YAML body instead of JSON |
| 23 | mixed-quotes |
Mixed single/double quotes |
| 24 | comment-in-json |
JSON with // comments |
| 25 | bom-prefix |
UTF-8 BOM before content |
| 26 | null-bytes |
Null bytes in strings |
| 27 | reversed-key-order |
arguments before name in JSON |
| 28 | multiline-values |
Embedded newlines in values |
| 29 | partial-schema |
Some required fields missing |
| 30 | xml-attribute-style |
Tool call as XML attributes |
pnpm add @ai-sdk-tool/parserimport { morphXmlToolMiddleware } from "@ai-sdk-tool/parser";
import { wrapLanguageModel, streamText } from "ai";
// Works with any AI SDK provider: OpenAI, Anthropic, Google, Ollama, etc.
const enhanced = wrapLanguageModel({
model: anyModel,
middleware: morphXmlToolMiddleware,
});
const result = await streamText({
model: enhanced,
prompt: "What is the weather in Seoul?",
tools: {
get_weather: {
description: "Get weather for a city",
parameters: z.object({ city: z.string() }),
execute: async ({ city }) => `${city}: 22°C, sunny`,
},
},
});git clone https://github.com/KIM3310/stage-pilot.git
cd stage-pilot
pnpm install
pnpm api:stagepilot
# → http://127.0.0.1:8080/demo| Middleware | Best for | Example models |
|---|---|---|
hermesToolMiddleware |
JSON-style tool payloads | Hermes, Llama |
morphXmlToolMiddleware |
XML + schema-aware coercion | Claude, GPT |
yamlXmlToolMiddleware |
XML tags + YAML bodies | Mixtral |
qwen3CoderToolMiddleware |
<tool_call> markup |
Qwen, UI-TARS |
pnpm api:stagepilot # http://127.0.0.1:8080| Endpoint | Method | What it does |
|---|---|---|
/v1/plan |
POST | Run a case through the 5-stage pipeline |
/v1/benchmark |
POST | Run the full benchmark suite |
/v1/insights |
POST | Narrative insights from benchmark data |
/v1/whatif |
POST | What-if simulation for staffing/demand |
/v1/metrics |
GET | Prometheus metrics (scrape-ready) |
/health |
GET | Health check (K8s probes) |
/demo |
GET | Interactive demo UI |
Docker
docker build -t stagepilot-api .
docker run -p 8080:8080 -e GEMINI_API_KEY="$GEMINI_API_KEY" stagepilot-apiGCP Cloud Run (one command)
pnpm deploy:stagepilotInfrastructure managed by Terraform:
cd infra/terraform
terraform init && terraform applyKubernetes (production)
kubectl create namespace stagepilot
kubectl create secret generic stagepilot-secrets \
--namespace stagepilot \
--from-literal=gemini-api-key="$GEMINI_API_KEY"
kubectl apply -f infra/k8s/
# Includes: Deployment (2 replicas), Service, HPA (2-10 pods),
# ConfigMap, liveness/readiness/startup probesVercel / Cloudflare Workers
See vercel.json and wrangler.toml in the repo root.
┌─────────────────────────────────────────────────────┐
│ StagePilot API │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ OTel SDK │ │Prometheus│ │ Datadog Agent │ │
│ │ Spans │ │ Counters │ │ (optional) │ │
│ └────┬─────┘ └────┬─────┘ └────────┬─────────┘ │
│ │ │ │ │
└───────┼──────────────┼─────────────────┼────────────┘
│ │ │
┌────▼────┐ ┌─────▼─────┐ ┌──────▼──────┐
│ Jaeger │ │ Grafana │ │ Datadog │
│ Zipkin │ │ Dashboard │ │ Dashboard │
└──────────┘ └───────────┘ └─────────────┘
- OpenTelemetry: Per-stage spans (
eligibility_span,safety_span,planner_span,outreach_span,judge_span,parse_span) - Prometheus:
toolCallsTotalcounter +toolCallParseDurationhistogram, scraped via/v1/metrics - Datadog: Pre-built dashboard + monitor configs in
docs/datadog/
src/
adapters/ # AWS S3/CloudWatch, GCP integrations
api/ # HTTP server, Prometheus metrics, sessions
bin/ # CLI entry points (stagepilot-api, benchlab-api)
community/ # Community protocols (Sijawara, UI-TARS)
core/ # Parser protocols (8 variants), prompts, utils
rjson/ # Relaxed JSON parser with repair heuristics
rxml/ # Relaxed XML parser with tokenizer + schema extraction
schema-coerce/ # Type coercion engine
stagepilot/ # 5-agent orchestrator, benchmark, insights, twin
telemetry/ # OpenTelemetry + Prometheus instrumentation
__tests__/ # ~174 unit test files
tests/ # ~13 integration test files
infra/
k8s/ # Deployment, Service, HPA, ConfigMap
terraform/ # GCP Cloud Run provisioning
docs/
adr/ # Architecture Decision Records
benchmarks/ # Benchmark artifacts + reports
benchlab/ # BFCL experiment docs
datadog/ # Dashboard + monitor configs
experiments/ # 5 BFCL experiment variants (Claude, Gemini, Grok, Kiro, OpenAI-compat)
scripts/ # Build, deploy, load-test (k6)
.github/workflows/ # CI/CD pipelines
| ADR | Title | Summary |
|---|---|---|
| ADR-001 | Stage-Gated Pipeline | Why 5 sequential agents instead of single-pass. Each stage isolates a concern, emits OTel spans, enables independent model selection. |
| ADR-002 | Parser as AI SDK Middleware | Why middleware pattern over custom wrapper or post-processing. Provider-agnostic, composable, own npm lifecycle. |
| ADR-003 | Benchmark Methodology | Why deterministic seeded cases with 30 mutation modes. Reproducible, captures real-world failure patterns, separates format issues from model understanding gaps. |
| Category | Technologies |
|---|---|
| Language | TypeScript 5.9, Node.js 20 |
| AI SDK | Vercel AI SDK 6.0, Zod 4.3 |
| Parsing | Custom RJSON + RXML engines, 8 protocol variants |
| Observability | OpenTelemetry (spans), Prometheus (metrics), Datadog (dashboards) |
| Infrastructure | Docker, Kubernetes (HPA), Terraform, GCP Cloud Run |
| Deployment | GCP Cloud Run, Vercel, Cloudflare Workers |
| Cloud | AWS (S3, CloudWatch), GCP (Cloud Run, Secret Manager) |
| Testing | Vitest, ~187 test files, v8 coverage |
| CI/CD | GitHub Actions |
- npm: @ai-sdk-tool/parser
- Demo: YouTube
- Blog: Tool-calling 성공률을 25%에서 90%로 올린 방법 / English
- Based on: minpeter/ai-sdk-tool-call-middleware
- Related: tool-call-finetune-lab — Fine-tuning approach for the remaining 10% gap
Apache-2.0