Structured Evals

A TypeScript starter kit for building calibrated, task-specific eval packs for AI systems.

Generic evals hide the failure surface. Structured Evals helps teams turn domain-shaped AI failures into benchmark artifacts: fixtures, criteria, slices, graders, failure modes, adjudication, calibration, and reports.

It is designed for AI systems where one generic score is too blunt and failures need to be named, sliced, adjudicated, and improved over time.

Who This Is For

Use this repo if you are building an AI system where correctness is structured, domain-specific, and hard to capture with one generic score.

Examples:

A UI generation system that must preserve hierarchy, roles, interactions, and design intent.
An intent router that must distinguish similar user requests without over-routing.
An agent workflow where failures need to be classified by root cause, not just pass/fail.
A product team that needs benchmark artifacts reviewers can inspect and improve over time.

When Not To Use This

This repo is probably not what you want if you only need:

a generic LLM-as-judge wrapper
a hosted dashboard
a pixel-diff visual regression tool
a model benchmark leaderboard
a one-off test script

What You Can Do With This Repo

1. Define a benchmark for a messy AI task

Create an eval.manifest.yaml that defines the task, criteria, slices, coverage, and limitations.

2. Run structured candidates through task-specific graders

Use deterministic graders to produce failure modes, repair hints, slice summaries, and reports.

3. Evaluate your own system outputs

Bring your own ScreenIR Lite output, exported signal output, or resolver adapter.

4. Calibrate the evaluator

Compare evaluator output against human adjudication to measure agreement and refine thresholds, severities, and failure relationships.

5. Create a new eval pack

Copy the templates and define your own fixture schema, criteria, failure modes, reports, and calibration set.

Try It In 10 Minutes

npm install
npm run benchmarks:list
npm run cli -- run ui-reconstruction ui.mobile-login --candidate bad
npm run cli -- calibrate ui.mobile-login

Then point your own output at an existing benchmark:

npm run cli -- generate-report ui-reconstruction ui.mobile-login \
  --candidate-file ./my-output.json \
  --candidate-label my-system-v1 \
  --out ./tmp/mobile-login.md \
  --json-out ./tmp/mobile-login.json

The bad UI report shows how the same reconstruction can preserve some visible structure while losing hierarchy, interaction affordances, and token intent. The calibration report shows how evaluator output compares against reviewer adjudication.

Start Here

Your problem	Start with
I generate UI from screenshots or design references	`@structured-evals/ui-reconstruction`
I route user messages to domain intents	`@structured-evals/intent-vocabulary`
I want to evaluate my own system output	Using your own system
I want to create my own task-specific eval	Building a task-specific eval pack
I want to calibrate an evaluator	Evaluator calibration
I want to see all included benchmarks	Benchmark index
I want copyable usage examples	Recipes

Workflow

messy AI task
  |
  v
benchmark manifest
  |
  v
fixtures + expected outputs
  |
  v
task-specific graders
  |
  v
failure modes + slices
  |
  v
human-readable report + JSON run record
  |
  v
adjudication + calibration
  |
  v
refined evaluator

Included Packs

`@structured-evals/core`

Task-agnostic runner, fixtures, grader contracts, weighted scoring, aggregation, benchmark manifests, adjudication helpers, and report primitives.

`@structured-evals/ui-reconstruction`

Evaluates screenshot-to-UI reconstruction fidelity using ScreenIR Lite. Pixel similarity is not enough; UI fidelity is structural.

Start with:

`@structured-evals/intent-vocabulary`

Evaluates domain-specific NLU signal coverage with accepted/proposed fixture tracks, actual signal exports, and an injected resolver interface.

Start with:

Use With Your Own System

Structured Evals does not call your model. It evaluates structured outputs you provide.

UI reconstruction: pass --candidate-file with a ScreenIR Lite JSON candidate.
Intent vocabulary: pass --actual-signals with exported signal outputs or --resolver with a local resolver module.
CI/reporting: pass --out and --json-out to write Markdown and machine-readable run records.

Read Using your own system, then try the recipes.

Evaluator Calibration

The repo includes a calibrated benchmark example for ui.mobile-login.

It shows how to create candidate variants, adjudicate expected failures, compare evaluator output to reviewer judgment, inspect disagreements, and refine evaluator behavior.

Start here:

Why Structured Evals

Generic scoring is useful when the task is generic. Most product AI tasks are not generic.

Read Why Structured Evals for the core pattern: reusable infrastructure belongs in the core; task truth belongs in eval packs.

Building A Pack

For benchmark-specific authoring, read in this order:

A useful pack should provide:

fixture schema
candidate or resolver interface
benchmark manifest with criteria and slices
graders or focused evaluator
structured failure taxonomy
JSON and human-readable reporting
eval card
small committed fixtures
tests proving meaningful failure modes

What This Intentionally Does Not Do

no LLM-as-judge in v0.1
no OCR
no screenshot parsing
no browser rendering
no hosted dashboards
no private product adapters
no proprietary product screens or private data

Design Principles

Structured Evals is built around a few assumptions:

evals are product infrastructure, not just tests
generic runners should stay small
task truth belongs in packs
failure modes should be named and inspectable
evaluators need calibration
reports should drive improvement, not just produce a score

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
packages		packages
templates		templates
.gitignore		.gitignore
AGENTS.md		AGENTS.md
BENCHMARKS.md		BENCHMARKS.md
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
benchmarks.registry.json		benchmarks.registry.json
biome.json		biome.json
package-lock.json		package-lock.json
package.json		package.json
tsconfig.base.json		tsconfig.base.json
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Structured Evals

Who This Is For

When Not To Use This

What You Can Do With This Repo

1. Define a benchmark for a messy AI task

2. Run structured candidates through task-specific graders

3. Evaluate your own system outputs

4. Calibrate the evaluator

5. Create a new eval pack

Try It In 10 Minutes

Start Here

Workflow

Included Packs

`@structured-evals/core`

`@structured-evals/ui-reconstruction`

`@structured-evals/intent-vocabulary`

Use With Your Own System

Evaluator Calibration

Why Structured Evals

Building A Pack

What This Intentionally Does Not Do

Design Principles

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Structured Evals

Who This Is For

When Not To Use This

What You Can Do With This Repo

1. Define a benchmark for a messy AI task

2. Run structured candidates through task-specific graders

3. Evaluate your own system outputs

4. Calibrate the evaluator

5. Create a new eval pack

Try It In 10 Minutes

Start Here

Workflow

Included Packs

@structured-evals/core

@structured-evals/ui-reconstruction

@structured-evals/intent-vocabulary

Use With Your Own System

Evaluator Calibration

Why Structured Evals

Building A Pack

What This Intentionally Does Not Do

Design Principles

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`@structured-evals/core`

`@structured-evals/ui-reconstruction`

`@structured-evals/intent-vocabulary`

Packages