A TypeScript starter kit for building calibrated, task-specific eval packs for AI systems.
Generic evals hide the failure surface. Structured Evals helps teams turn domain-shaped AI failures into benchmark artifacts: fixtures, criteria, slices, graders, failure modes, adjudication, calibration, and reports.
It is designed for AI systems where one generic score is too blunt and failures need to be named, sliced, adjudicated, and improved over time.
Use this repo if you are building an AI system where correctness is structured, domain-specific, and hard to capture with one generic score.
Examples:
- A UI generation system that must preserve hierarchy, roles, interactions, and design intent.
- An intent router that must distinguish similar user requests without over-routing.
- An agent workflow where failures need to be classified by root cause, not just pass/fail.
- A product team that needs benchmark artifacts reviewers can inspect and improve over time.
This repo is probably not what you want if you only need:
- a generic LLM-as-judge wrapper
- a hosted dashboard
- a pixel-diff visual regression tool
- a model benchmark leaderboard
- a one-off test script
Create an eval.manifest.yaml that defines the task, criteria, slices, coverage, and limitations.
Use deterministic graders to produce failure modes, repair hints, slice summaries, and reports.
Bring your own ScreenIR Lite output, exported signal output, or resolver adapter.
Compare evaluator output against human adjudication to measure agreement and refine thresholds, severities, and failure relationships.
Copy the templates and define your own fixture schema, criteria, failure modes, reports, and calibration set.
npm install
npm run benchmarks:list
npm run cli -- run ui-reconstruction ui.mobile-login --candidate bad
npm run cli -- calibrate ui.mobile-loginThen point your own output at an existing benchmark:
npm run cli -- generate-report ui-reconstruction ui.mobile-login \
--candidate-file ./my-output.json \
--candidate-label my-system-v1 \
--out ./tmp/mobile-login.md \
--json-out ./tmp/mobile-login.jsonThe bad UI report shows how the same reconstruction can preserve some visible structure while losing hierarchy, interaction affordances, and token intent. The calibration report shows how evaluator output compares against reviewer adjudication.
| Your problem | Start with |
|---|---|
| I generate UI from screenshots or design references | @structured-evals/ui-reconstruction |
| I route user messages to domain intents | @structured-evals/intent-vocabulary |
| I want to evaluate my own system output | Using your own system |
| I want to create my own task-specific eval | Building a task-specific eval pack |
| I want to calibrate an evaluator | Evaluator calibration |
| I want to see all included benchmarks | Benchmark index |
| I want copyable usage examples | Recipes |
messy AI task
|
v
benchmark manifest
|
v
fixtures + expected outputs
|
v
task-specific graders
|
v
failure modes + slices
|
v
human-readable report + JSON run record
|
v
adjudication + calibration
|
v
refined evaluatorTask-agnostic runner, fixtures, grader contracts, weighted scoring, aggregation, benchmark manifests, adjudication helpers, and report primitives.
Evaluates screenshot-to-UI reconstruction fidelity using ScreenIR Lite. Pixel similarity is not enough; UI fidelity is structural.
Start with:
Evaluates domain-specific NLU signal coverage with accepted/proposed fixture tracks, actual signal exports, and an injected resolver interface.
Start with:
Structured Evals does not call your model. It evaluates structured outputs you provide.
- UI reconstruction: pass
--candidate-filewith a ScreenIR Lite JSON candidate. - Intent vocabulary: pass
--actual-signalswith exported signal outputs or--resolverwith a local resolver module. - CI/reporting: pass
--outand--json-outto write Markdown and machine-readable run records.
Read Using your own system, then try the recipes.
The repo includes a calibrated benchmark example for ui.mobile-login.
It shows how to create candidate variants, adjudicate expected failures, compare evaluator output to reviewer judgment, inspect disagreements, and refine evaluator behavior.
Start here:
Generic scoring is useful when the task is generic. Most product AI tasks are not generic.
Read Why Structured Evals for the core pattern: reusable infrastructure belongs in the core; task truth belongs in eval packs.
For benchmark-specific authoring, read in this order:
- Building a task-specific eval pack
- Benchmark manifests
- Slices and criteria
- Adjudication
- CLI
- Templates
A useful pack should provide:
- fixture schema
- candidate or resolver interface
- benchmark manifest with criteria and slices
- graders or focused evaluator
- structured failure taxonomy
- JSON and human-readable reporting
- eval card
- small committed fixtures
- tests proving meaningful failure modes
- no LLM-as-judge in v0.1
- no OCR
- no screenshot parsing
- no browser rendering
- no hosted dashboards
- no private product adapters
- no proprietary product screens or private data
Structured Evals is built around a few assumptions:
- evals are product infrastructure, not just tests
- generic runners should stay small
- task truth belongs in packs
- failure modes should be named and inspectable
- evaluators need calibration
- reports should drive improvement, not just produce a score