Skip to content

romamo/cli-agent-spec

Repository files navigation

CLI Agent Spec

Your CLI tool works perfectly for humans. For AI agents, it silently hangs, corrupts data, leaks secrets, and exhausts context windows — and you would never know.

This is a specification for building CLI tools that AI agents can call reliably: 66 documented failure modes, 135 requirements to eliminate them, and machine-readable schemas an agent can consume directly.

No existing CLI framework covers more than 58% of these challenges.


What's going wrong right now

AI agents call CLI tools constantly — to deploy infrastructure, query APIs, manage files, run pipelines. Most tools were never designed for this. Here is what agents actually encounter:

# Agent calls a list command. The tool pages output and waits for keypress.
# The agent never receives a response. The pipeline stalls. Forever.
$ kubectl get pods   # opens less, waits for input

# Agent deploys to staging. The command times out at 30s, returns exit 1.
# exit 1 means "error" — but does it mean "nothing happened" or "half-deployed"?
# The agent retries. Now it's deployed twice.
$ deploy --env staging   # exit 1 — but why? safe to retry?

# Agent reads a list of users. One username contains an emoji.
# The JSON serializer crashes on non-ASCII. The agent gets no output, no error.
$ tool users list   # silent failure on emoji in username

# Agent passes a flag after the subcommand — natural LLM ordering.
# The parser silently treats --output as a positional argument value.
# The agent receives plain text it can't parse. Exit code: 0.
$ tool list users --output json   # parsed as: list "users" "--output" "json"

These are not edge cases. They are the default behavior of most CLI tools today — including tools from major companies. The cost falls on the agent: wasted tokens, stalled pipelines, data corruption from blind retries, cascading failures with no root cause.


What this spec defines

66 failure modes — each documented with severity, frequency, detectability, token cost, time cost, and context cost from the agent's perspective. Grouped into 7 parts: ecosystem/runtime, execution, security, output, environment, errors, and observability.

135 requirements across 3 tiers:

Tier Count Who implements it
F — Framework-Automatic 67 The framework enforces it; command authors get it for free
C — Command Contract 27 Command authors declare it at registration
O — Opt-In 41 Applications enable it explicitly

4 JSON schemas — machine-readable type definitions for exit codes, response envelopes, tool manifests, and error details. Generate typed structs for your language directly from the schemas.

A comparison matrix — 12 existing frameworks (argparse, Click, Cobra, Clap, Typer, Commander.js, and more) scored against all 66 challenges. No framework exceeds 58%.


The three contracts that matter most

Exit codes — 14 named codes (0–13) with machine-readable guarantees per code: retryable: true/false, side_effects: "none" | "partial" | "complete". An agent receiving exit 11 (CONFLICT) knows the operation is safe to retry. Receiving exit 6 (PARTIAL_FAILURE) knows it must inspect state before retrying. See exit-code.json.

Response envelope — every command wraps its output in { ok, data, error, warnings, meta }. The same keys are always present. Agents never parse free-text to determine success or failure. See response-envelope.json.

Tool manifesttool manifest --output json returns the complete command tree: every subcommand, flag, type, description, exit code map, and example. One call replaces O(N) --help iterations and eliminates trial-and-error argument discovery. See manifest-response.json.


What's in this repo

Path Contents
challenges/ 66 failure modes, each with problem, impact, solutions, 0–3 evaluation rubric, and agent workaround
requirements/ 135 requirements with acceptance criteria, wire format, and examples
schemas/ JSON Schema draft-07 definitions for all 4 types
IMPLEMENTING.md Implementation guide: wave-based order, goal-based paths, invariants, codegen
comparison-matrix.md 66 challenges × 12 frameworks coverage table
research/ Per-framework analysis and competitive landscape (MCP, OpenAPI, function calling)
skills/ Agent skills for evaluating CLIs and guiding implementation

Start here

I want to understand the problemchallenges/index.md — browse by severity. Start with §10 (interactive blocking), §43 (output size), §50 (stdin deadlock), §62 (editor trap).

I want to implement this in my frameworkIMPLEMENTING.md — wave-based implementation order, or pick a goal-based path:

I want to evaluate my existing CLI → use the agent skills below, or read challenges/checklist.md for a self-assessment.

I want to add a challenge or requirementAGENTS.md


Agent skills

Three installable skills for Agent Skills-compatible agents (Claude Code, Cursor, Gemini CLI, Copilot, and others):

Skill Purpose
cli-agent-onboard Profile a CLI tool once — detects runtime, binary, flags, timeout method
cli-agent-evaluate Score a CLI against a single challenge (0–3), with applicable agent workaround
cli-agent-implement Guide implementing the spec in a CLI framework, tier by tier
# Install (run inside your agent)
npx skills install romamo/cli-agent-spec/skills/cli-agent-onboard
npx skills install romamo/cli-agent-spec/skills/cli-agent-evaluate
npx skills install romamo/cli-agent-spec/skills/cli-agent-implement

Contributing

The spec is a living document. New challenges are added when a failure mode is confirmed against real tooling. New requirements follow from new challenges.

Before contributing, read AGENTS.md for conventions: file format, required sections, naming rules, and how to run /validate-links to verify cross-references after any edit.


CLI Agent Spec v1.5 — 66 challenges · 135 requirements · 4 schemas · 12 frameworks evaluated

About

A specification for building CLI tools that work reliably under AI agent orchestration

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors