Michael McCarty, Sovereign Labs March 2026
AI agents running in tool-calling loops systematically repeat failed strategies. We present narrowing, a constraint-learning runtime that extracts failure signatures from execution outcomes and structurally removes failed strategies from the agent's action space. Unlike prompt-based memory (which degrades under context window compression), narrowing constraints persist across sessions and process restarts as structural guardrails enforced before execution — the agent never gets to retry what's banned.
We evaluate narrowing in two domains: (1) LLM-guided hyperparameter search on GPU, where we measure constraint activation against a frontier model (Gemini 2.5 Flash), and (2) agent tool-calling loops, where we address the documented failure patterns that have cost production users hundreds of dollars and hundreds of gigabytes.
Our GPU benchmarks reveal a nuanced finding: frontier LLMs self-correct on simple failure boundaries within 1-2 trials, compressing narrowing's within-session value. However, every new session rediscovers the same failures from scratch — cross-session constraint persistence eliminates this redundancy entirely. The tool-call adapter addresses the more immediate market need: preventing the infinite loops, repeated reads, and runaway resource creation documented in VS Code Copilot, Kilo Code, n8n, and Claude Code.
The runtime is 2,200 LOC of zero-dependency TypeScript with domain adapters, tamper-evident receipt chains, and convergence detection. Published as @sovereign-labs/narrowing under MIT license.
When an AI agent calls tools in a loop — editing files, running commands, querying APIs — it occasionally fails. The architecturally interesting question is: what happens next?
In production, the answer is often: it tries the same thing again.
This isn't a capability failure. Frontier LLMs can reason about errors when the failure is visible in their context window. The problem is architectural:
-
Context window compression. Long-running agents periodically summarize their conversation history to fit within token limits. This summarization erases the specific details of prior failures — file paths, error messages, exact parameter values. The agent genuinely doesn't know it already tried this approach.
-
Session boundaries. When an agent process restarts (crash recovery, new session, handoff), all runtime memory is lost. The next session rediscovers every failure from scratch.
-
No structural enforcement. Even when failure information is technically present in the context, the LLM may choose to retry the same approach. Prompt-based memory is advisory — the model can and does ignore it.
These are not hypothetical concerns. The following incidents are documented in public issue trackers:
| System | Incident | Impact | Reference |
|---|---|---|---|
| VS Code Copilot | Background agent created 1,526 git worktrees in 16 hours | ~800GB disk consumed | microsoft/vscode#296194 |
| Kilo Code | Agent read the same file 1,000 times in a loop | $7.59 burned, 8.5M tokens | Kilo-Org/kilocode#3767 |
| n8n | AI agents stuck in infinite tool-call loops | ~50% occurrence rate | n8n-io/n8n#13525 |
| Claude Code | Infinite compaction → re-read → compaction cycle | Task completion blocked | anthropics/claude-code#6004 |
In each case, the agent repeated an action that had already failed or was demonstrably unproductive. No structural mechanism existed to prevent the repetition.
Narrowing is a constraint-learning runtime that sits between the agent's proposal generator (the LLM) and the execution environment. It provides three APIs:
checkProposal()— Before execution: is this proposal allowed given current constraints?recordOutcome()— After execution: extract failure signature, classify blame, seed constraints if corroborated.isDone()— Has the agent exhausted its viable search space?
The key design decisions:
-
Structural enforcement, not advisory. Constraints are checked by the runtime, not suggested to the LLM. A banned strategy cannot be attempted regardless of what the model generates.
-
Blame classification. Infrastructure faults (timeouts, rate limits, permission errors) never seed constraints. Only agent-attributable failures narrow the search space. This prevents the "poisoned well" where environmental noise progressively eliminates all valid strategies.
-
Corroboration threshold. A single failure does not seed a constraint. By default, a failure signature must be observed twice before generating a structural ban. This prevents over-reaction to transient errors.
-
Domain adapters. The core runtime is domain-agnostic. Adapters translate domain-specific signals (GPU OOM, file-not-found, edit-failed) into the universal constraint language.
Agent Proposal Generator (LLM)
↓ proposal
NarrowingLoop
├── ConstraintStore.checkProposal()
│ ├── Strategy ban: Is this action class banned?
│ ├── Radius limit: Too many targets?
│ └── Parameter ban: Is this specific value banned?
│
│ → { allowed: false, violations } → reject, feed back to LLM
│ → { allowed: true } → proceed to execution
│
↓ execution
↓ outcome
NarrowingLoop.recordOutcome()
├── Adapter.extractSignature() Regex-based, deterministic
├── Adapter.classifyBlame() agent_failure | harness_fault | unknown
├── Adapter.classifyAction() Action class for strategy bans
├── ConstraintStore.seedFromOutcome() Learn if corroborated
├── ConvergenceTracker.update() Progress detection
├── Journal.append() Event log
└── ReceiptChain.append() Tamper-evident audit trail
| Type | Trigger | Effect | Example |
|---|---|---|---|
banned_strategy |
Same action class + failure signature, 2+ times | Blocks any proposal with that action class | "file_edit with edit_failed: banned" |
radius_limit |
Progressive failures | Shrinks max allowed targets: ∞ → 5 → 3 → 2 → 1 | Forces smaller, safer changes |
parameter_ban |
Specific value failed 2+ times | Blocks proposals with that exact parameter value | n_embd=1024 caused OOM twice |
Signatures are extracted deterministically via regex pattern matching — no LLM inference required. Each domain adapter provides signature patterns ordered by priority (first match wins).
The tool-call adapter provides 12 signatures:
| Signature | Pattern | Blame |
|---|---|---|
tool_timeout |
timeout|ETIMEDOUT|deadline exceeded |
harness_fault |
tool_not_found |
tool not found|unknown tool |
harness_fault |
permission_denied |
permission denied|EACCES|403 |
harness_fault |
rate_limited |
429|rate.?limit|too many requests |
harness_fault |
file_not_found |
ENOENT|no such file|file not found |
agent_failure |
syntax_error |
SyntaxError|parse error|invalid JSON |
agent_failure |
edit_failed |
search string not found|edit.*failed |
agent_failure |
command_failed |
exit code [1-9]|command failed |
agent_failure |
validation_error |
validation failed|invalid.*argument |
agent_failure |
conflict |
409|conflict|already exists |
agent_failure |
empty_result |
no results|empty response|null |
agent_failure |
api_error |
500|502|503|internal server error |
unknown |
Infrastructure faults (harness_fault) are recorded but never seed constraints. This distinction is critical — without it, a flaky network connection would progressively ban every API call the agent attempts.
The ConvergenceTracker monitors whether the agent is making progress:
- Progressing: Scores improving or new strategies being explored
- Plateau: No score improvement within a configurable window
- Exhausted: Every viable strategy has been tried or banned
isDone() returns true when the tracker detects exhaustion — the search space has been narrowed to the point where no valid proposals remain. This enables graceful termination instead of running until a timeout.
Two persistence mechanisms:
Journal — Append-only event log (narrowing.jsonl). Records every outcome, constraint seed, and proposal check. Enables post-hoc analysis and replay.
Receipt chain — Tamper-evident hash chain where each receipt includes hash = sha256(previousHash + canonicalPayload). Modification of any receipt invalidates all subsequent hashes. Enables auditing of the constraint-learning process.
Cross-session persistence via snapshot() / restore():
// End of session
const state = loop.snapshot();
fs.writeFileSync('narrowing-state.json', JSON.stringify(state));
// Start of next session
const saved = JSON.parse(fs.readFileSync('narrowing-state.json'));
loop.restore(saved);
// Constraints from prior sessions are immediately activeWe evaluate narrowing in Karpathy's autoresearch framework — an automated ML research loop where an LLM proposes hyperparameter configurations, trains a character-level language model, and iterates based on results.
Task: Train a nanoGPT-style transformer on the Shakespeare corpus. Score: validation bits-per-byte (bpb, lower is better). Training budget: 90 seconds per trial.
Hardware: NVIDIA L4 GPU (24GB VRAM), GCP g2-standard-4.
LLM: Gemini 2.5 Flash (temperature=0.3, thinkingBudget=0). The LLM sees the full history of prior trials and proposes the next configuration.
Search space: Three parameters — depth (transformer layers), aspect_ratio (width-to-depth ratio), matrix_lr (learning rate scale).
Narrowing adapter: ML training adapter with 13 failure signatures and 8 action classes.
| Version | Configuration | Purpose |
|---|---|---|
| v5 | L4, bs=16, 3 seeds × 15 trials | Baseline: LLM integration validation |
| v6 | L4, bs=32, 5 seeds × 25 trials | Primary: constraint activation test |
Why two versions: v5 established that the LLM integration works end-to-end but revealed that the L4's 24GB VRAM was too generous — Gemini's conservative proposals (depth 3-10) never approached the OOM boundary (~depth 14-16). Zero constraints were seeded.
v6 shifts the OOM boundary closer by increasing batch size from 16 to 32, which increases per-step memory consumption. (Batch size 64 was calibrated first but caused 100% OOM with LLM-guided proposals — Gemini's natural range of depth 8-32 all exceeded the bs=64 boundary. Batch size 32 places the boundary in the middle of the proposal space.) At bs=32, depth=8 OOMs — exactly where Gemini naturally starts proposing. Both vanilla and narrowing modes use the same batch size, so the confound affects both equally.
Protocol: 5 seeds × 2 modes × 25 trials = 250 training runs. ~10 hours wall time on L4.
| Metric | Vanilla | Narrowing | Delta |
|---|---|---|---|
| Successful trials | 113/125 | 102/125 | -11 |
| OOM failures | 10 | 10 | 0 |
| Blocked proposals | 0 | 12 | +12 |
| Wasted trials (OOM + timeout) | 12 | 23 | +11 |
| Best score (bpb) | 1.891 | 1.866 | -0.025 |
| Average score | 1.689 | 1.747 | +0.058 |
Every seed, both modes, produced an identical pattern:
- Trial 1: Baseline (deterministic)
- Trial 2: Gemini proposes depth=8+ → OOM
- Trial 3 vanilla: Gemini retries a similar config → OOM again (repeated failure)
- Trial 3 narrowing: Constraint already seeded → proposal blocked → no GPU waste
- Trials 4-25: Gemini self-corrects, stays below the boundary in both modes
This pattern was perfectly consistent across all 5 seeds.
Mechanism activation: YES. 12 blocked proposals across 5 seeds. Constraints seeded from real OOM failures. Every block corresponded to a configuration that would have exceeded the GPU memory boundary.
Within-session value: MARGINAL. Gemini 2.5 Flash adapts to OOM failures within 1 trial. The 12 blocked proposals saved ~360 seconds of GPU time (12 × ~30s per OOM), but Gemini would have self-corrected on the very next trial regardless.
Cross-session value: CLEAR. Without narrowing, every new seed starts fresh — trial 2 always hits the same OOM boundary. With persistent constraint stores, subsequent sessions would start with the boundary already known. The 10 OOMs that occurred identically across all 5 vanilla seeds would be prevented entirely in cross-session mode.
Score paradox: Vanilla achieved the best single score (1.891) because unconstrained exploration occasionally lands on near-optimal aggressive configurations. Narrowing's higher average (1.747 vs 1.689) reflects that constraints keep the LLM in productive regions. This suggests a refinement: confidence-calibrated constraints that loosen over time.
v6 demonstrates that narrowing's constraint mechanism activates correctly against real GPU failures and a frontier LLM. However, the within-session benefit is small because Gemini 2.5 Flash is smart enough to learn from a single OOM.
Narrowing's value scales with:
- Horizon length — More trials means more opportunities for context window compression to erase failure evidence
- Failure complexity — Simple boundaries (single OOM threshold) are easy for frontier models. Multi-dimensional failure surfaces (tool interaction effects, cascading errors) are harder
- Model capability — Weaker models (local Ollama, smaller GPT variants) benefit more from structural constraints
- Session count — Cross-session persistence provides value proportional to how often the agent restarts
The tool-call adapter is the primary integration surface for agent framework developers. It classifies any tool call into:
Action classes (7): file_read, file_edit, file_create, shell_exec, search, api_call, delete
Classification uses normalized tool name matching. Tool names in any convention (snake_case, camelCase, dash-case) are normalized to space-separated tokens before word-boundary regex matching. When the tool name is ambiguous, parameter shape provides fallback classification (e.g., presence of command key → shell_exec).
Failure signatures (12): Regex-based extraction from error messages, ordered by priority. Each signature has a default blame classification that can be overridden by context.
Fingerprinting: Structural identity of a tool call — what tool, what target, what pattern. Used for exact-match constraint checking. Not a hash of full arguments (too brittle), but an extraction of the structural parameters that define "same call."
import { NarrowingLoop } from '@sovereign-labs/narrowing';
import { createToolCallAdapter, toolCallToProposal, toolCallToOutcome }
from '@sovereign-labs/narrowing/adapters/tool-call';
const loop = new NarrowingLoop({ adapter: createToolCallAdapter() });
// Before every tool call
const check = loop.checkProposal(
toolCallToProposal(toolName, args)
);
if (!check.allowed) {
feedbackToLLM(check.violations[0].reason);
continue;
}
// After tool call
loop.recordOutcome(toolCallToOutcome(toolName, args, result));Reproduced in our test suite:
- Agent calls
read_fileondata.json→ empty result - Agent calls
read_fileondata.jsonagain → same empty result (corroboration) - Constraint seeded:
file_readstrategy withempty_resultsignature - Third
read_fileondata.json→ BLOCKED
In production (Kilo Code), this loop ran 1,000 times, consuming 8.5M tokens ($7.59). With narrowing, it would have been blocked on attempt 3.
Critical to the design: infrastructure faults never seed constraints.
If an API endpoint returns 429 (rate limited) twice, narrowing does NOT ban the api_call strategy. Rate limiting is the server's problem, not the agent's mistake. Banning API calls after rate limits would be the "poisoned well" — environmental noise progressively eliminating every valid strategy until the agent has no moves left.
The blame classifier distinguishes:
- Agent failures (file not found, syntax error, edit failed) → constraints seeded after corroboration
- Harness faults (timeout, rate limit, permission denied) → recorded, never constraining
- Unknown (5xx errors) → recorded, conservative handling
The strongest argument for narrowing is at long horizons where context compression erases failure evidence. We built a controlled benchmark that directly measures this effect.
Setup: A simulated agent makes 200 tool calls against a mock codebase (6 files, deterministic success/failure). The agent has a configurable "memory window" — it only remembers the last N outcomes. When a failure scrolls out of the window, the agent has a 30% probability of replaying the same failing call. This models the documented behavior in production agents (context compaction erasing failure evidence).
No LLM needed. No GPU. Deterministic via Mulberry32 PRNG. Runs in <1 second.
Results (seed=42, 200 calls):
| Memory Window | Vanilla Repeats | Narrowing Repeats | Blocked | Wasted Cost |
|---|---|---|---|---|
| 10 (aggressive) | 90 | 3 | 143 | $0.318 → $0.067 |
| 20 (typical) | 82 | 2 | 135 | $0.297 → $0.071 |
| 50 (generous) | 71 | 2 | 115 | $0.261 → $0.065 |
Degradation curve (window=20):
| Horizon | Vanilla Repeats | Narrowing Repeats | Delta |
|---|---|---|---|
| 50 calls | 10 | 1 | +9 |
| 100 calls | 34 | 2 | +32 |
| 150 calls | 58 | 2 | +56 |
| 200 calls | 82 | 2 | +80 |
Vanilla repeats grow linearly with horizon length — every time a failure scrolls out of the context window, the agent risks replaying it. Narrowing repeats flatline at 2 because constraints are structural (outside the context window) and survive compression.
Multi-seed stability: Narrowing wins on 5/5 seeds (seeds 1-5).
Cross-session persistence: Session 2 restored from session 1's snapshot immediately blocks the same calls — zero rediscovery. Infrastructure faults (10 consecutive timeouts) correctly produce zero constraints.
Cost projection at Kilo Code scale: 215 saved calls per 200-call session. At $0.003/call, $193.50/month savings for a 10-session/day workload.
This benchmark addresses the limitation identified in §3.6 — within-session value is marginal for smart models on simple boundaries, but context window degradation is not a model capability problem. It is an architecture problem. Narrowing solves it structurally.
LangChain Memory and CrewAI Long-Term Memory store conversation history and retrieved context. These are content memories — they inform the LLM's reasoning but don't structurally prevent actions. The LLM can choose to ignore them.
Reflexion (Shinn et al., 2023) generates verbal self-reflections after failures. These reflections are re-injected into the prompt on the next attempt. Like narrowing, this creates a feedback loop from failures. Unlike narrowing, the reflections are advisory (the LLM can ignore them) and degrade under context compression.
Voyager (Wang et al., 2023) builds a skill library from successful executions in Minecraft. This is complementary to narrowing — Voyager remembers what works, narrowing remembers what doesn't.
Tabu search (Glover, 1986) maintains a list of recently visited solutions and forbids returning to them. Narrowing's banned_strategy constraint is conceptually similar, but operates on action classes rather than specific solutions, and uses failure signatures rather than visit recency for constraint generation.
Novelty search (Lehman & Stanley, 2011) rewards behavioral novelty to avoid convergence to local optima. Narrowing's convergence detection serves a similar purpose — detecting when the agent is spinning rather than exploring.
No existing system provides all three properties simultaneously:
- Structural enforcement (not advisory)
- Blame-aware learning (infrastructure faults don't poison the constraint store)
- Cross-session persistence (constraints survive context compression and process restarts)
Simple failure boundaries. v6 demonstrates that frontier LLMs self-correct on single-dimensional failure boundaries (GPU memory) within 1-2 trials. Narrowing's within-session value is marginal in this regime. The value proposition is stronger for multi-dimensional failure surfaces and longer horizons where context compression erases failure evidence.
Corroboration delay. The default threshold of 2 means the first failure always executes. For expensive operations (GPU training, cloud API calls), even one wasted execution has real cost. Configurable thresholds trade off between false positives (over-constraining from noise) and wasted executions.
Action class granularity. The tool-call adapter classifies at the tool-name level (file_edit), not at the strategy level ("rewrite the entire file" vs "patch one line"). Banning file_edit after two failures is potentially too broad. Finer-grained action classes would improve precision.
Cross-agent constraint sharing. Multiple agents working on related tasks could share constraint stores with scope-aware filtering. A constraint learned on one GPU configuration shouldn't automatically apply to a different GPU — but a constraint about a specific file being read-only should propagate immediately. The v6 dataset includes context fields to enable post-hoc simulation of scoped sharing (planned as v7).
Confidence-calibrated constraints. v6's score paradox (narrowing produces higher average but lower best score) suggests that permanent strategy bans are too aggressive. Constraints that weaken over time would allow re-exploration of boundary regions after the agent has established a baseline in safer territory.
Live LLM benchmarks with real context compression. §4.5 demonstrates the effect with simulated context windows. Validating this against live LLM agents (Claude Code, Cursor, LangGraph) where real context compaction occurs would strengthen the empirical case. The simulated 30% replay probability is based on production incident reports, but direct measurement of replay rates under actual compaction would calibrate the model.
Narrowing addresses an architectural gap in AI agent systems: the absence of persistent, structural failure memory. Frontier LLMs are capable of learning from failures when those failures are visible in their context window — but context windows compress, sessions restart, and advisory memory gets ignored.
The GPU benchmark (v6) demonstrates that the constraint mechanism activates correctly and that cross-session persistence provides clear value, even when within-session value is marginal against a smart model. The long-horizon tool-call benchmark (§4.5) proves the complementary point: context window degradation causes vanilla agents to repeat failures at linearly increasing rates, while narrowing holds flat regardless of horizon length. The tool-call adapter addresses the immediate production need: preventing the documented infinite loops, repeated reads, and runaway resource creation in shipping agent systems.
The runtime is small (2,200 LOC), zero-dependency, and designed for integration into any agent loop with three function calls. The domain adapter architecture ensures extensibility without core changes.
- Package:
@sovereign-labs/narrowing - License: MIT
- Runtime: Zero dependencies, pure TypeScript
- Tests: 72 tests, 182 assertions (core 26, tool-call adapter 38, long-horizon benchmark 8)
- Adapters: ML training (13 signatures, 8 action classes), tool-call (12 signatures, 7 action classes)
- Source: github.com/Born14/mcp-proxy (packages/narrowing)
v5 (L4, bs=16, 3 seeds × 15 trials) established that Gemini's proposals stay well within the L4's memory budget at batch size 16. Zero OOMs occurred. Zero constraints were seeded. This calibrated the boundary — the OOM wall was too far away for Gemini's conservative search to reach it. Full analysis: docs/narrowing/v5-analysis.md.
| Seed | Mode | OK | OOM | Timeout | Blocked | Best Score | Avg Score |
|---|---|---|---|---|---|---|---|
| 1 | vanilla | 22 | 2 | 1 | 0 | 1.815 | 1.609 |
| 1 | narrowing | 22 | 2 | 0 | 1 | 1.851 | 1.769 |
| 2 | vanilla | 23 | 2 | 0 | 0 | 1.891 | 1.724 |
| 2 | narrowing | 21 | 2 | 0 | 2 | 1.851 | 1.780 |
| 3 | vanilla | 23 | 2 | 0 | 0 | 1.886 | 1.658 |
| 3 | narrowing | 19 | 2 | 0 | 4 | 1.866 | 1.785 |
| 4 | vanilla | 22 | 2 | 1 | 0 | 1.890 | 1.803 |
| 4 | narrowing | 18 | 2 | 1 | 4 | 1.866 | 1.778 |
| 5 | vanilla | 23 | 2 | 0 | 0 | 1.891 | 1.653 |
| 5 | narrowing | 22 | 2 | 0 | 1 | 1.865 | 1.636 |