Skip to content

drewlittrell/structured-evals

Repository files navigation

Structured Evals

CI

A TypeScript starter kit for building calibrated, task-specific eval packs for AI systems.

Generic evals hide the failure surface. Structured Evals helps teams turn domain-shaped AI failures into benchmark artifacts: fixtures, criteria, slices, graders, failure modes, adjudication, calibration, and reports.

It is designed for AI systems where one generic score is too blunt and failures need to be named, sliced, adjudicated, and improved over time.

Who This Is For

Use this repo if you are building an AI system where correctness is structured, domain-specific, and hard to capture with one generic score.

Examples:

  • A UI generation system that must preserve hierarchy, roles, interactions, and design intent.
  • An intent router that must distinguish similar user requests without over-routing.
  • An agent workflow where failures need to be classified by root cause, not just pass/fail.
  • A product team that needs benchmark artifacts reviewers can inspect and improve over time.

When Not To Use This

This repo is probably not what you want if you only need:

  • a generic LLM-as-judge wrapper
  • a hosted dashboard
  • a pixel-diff visual regression tool
  • a model benchmark leaderboard
  • a one-off test script

What You Can Do With This Repo

1. Define a benchmark for a messy AI task

Create an eval.manifest.yaml that defines the task, criteria, slices, coverage, and limitations.

2. Run structured candidates through task-specific graders

Use deterministic graders to produce failure modes, repair hints, slice summaries, and reports.

3. Evaluate your own system outputs

Bring your own ScreenIR Lite output, exported signal output, or resolver adapter.

4. Calibrate the evaluator

Compare evaluator output against human adjudication to measure agreement and refine thresholds, severities, and failure relationships.

5. Create a new eval pack

Copy the templates and define your own fixture schema, criteria, failure modes, reports, and calibration set.

Try It In 10 Minutes

npm install
npm run benchmarks:list
npm run cli -- run ui-reconstruction ui.mobile-login --candidate bad
npm run cli -- calibrate ui.mobile-login

Then point your own output at an existing benchmark:

npm run cli -- generate-report ui-reconstruction ui.mobile-login \
  --candidate-file ./my-output.json \
  --candidate-label my-system-v1 \
  --out ./tmp/mobile-login.md \
  --json-out ./tmp/mobile-login.json

The bad UI report shows how the same reconstruction can preserve some visible structure while losing hierarchy, interaction affordances, and token intent. The calibration report shows how evaluator output compares against reviewer adjudication.

Start Here

Your problem Start with
I generate UI from screenshots or design references @structured-evals/ui-reconstruction
I route user messages to domain intents @structured-evals/intent-vocabulary
I want to evaluate my own system output Using your own system
I want to create my own task-specific eval Building a task-specific eval pack
I want to calibrate an evaluator Evaluator calibration
I want to see all included benchmarks Benchmark index
I want copyable usage examples Recipes

Workflow

messy AI task
  |
  v
benchmark manifest
  |
  v
fixtures + expected outputs
  |
  v
task-specific graders
  |
  v
failure modes + slices
  |
  v
human-readable report + JSON run record
  |
  v
adjudication + calibration
  |
  v
refined evaluator

Included Packs

@structured-evals/core

Task-agnostic runner, fixtures, grader contracts, weighted scoring, aggregation, benchmark manifests, adjudication helpers, and report primitives.

@structured-evals/ui-reconstruction

Evaluates screenshot-to-UI reconstruction fidelity using ScreenIR Lite. Pixel similarity is not enough; UI fidelity is structural.

Start with:

@structured-evals/intent-vocabulary

Evaluates domain-specific NLU signal coverage with accepted/proposed fixture tracks, actual signal exports, and an injected resolver interface.

Start with:

Use With Your Own System

Structured Evals does not call your model. It evaluates structured outputs you provide.

  • UI reconstruction: pass --candidate-file with a ScreenIR Lite JSON candidate.
  • Intent vocabulary: pass --actual-signals with exported signal outputs or --resolver with a local resolver module.
  • CI/reporting: pass --out and --json-out to write Markdown and machine-readable run records.

Read Using your own system, then try the recipes.

Evaluator Calibration

The repo includes a calibrated benchmark example for ui.mobile-login.

It shows how to create candidate variants, adjudicate expected failures, compare evaluator output to reviewer judgment, inspect disagreements, and refine evaluator behavior.

Start here:

Why Structured Evals

Generic scoring is useful when the task is generic. Most product AI tasks are not generic.

Read Why Structured Evals for the core pattern: reusable infrastructure belongs in the core; task truth belongs in eval packs.

Building A Pack

For benchmark-specific authoring, read in this order:

  1. Building a task-specific eval pack
  2. Benchmark manifests
  3. Slices and criteria
  4. Adjudication
  5. CLI
  6. Templates

A useful pack should provide:

  • fixture schema
  • candidate or resolver interface
  • benchmark manifest with criteria and slices
  • graders or focused evaluator
  • structured failure taxonomy
  • JSON and human-readable reporting
  • eval card
  • small committed fixtures
  • tests proving meaningful failure modes

What This Intentionally Does Not Do

  • no LLM-as-judge in v0.1
  • no OCR
  • no screenshot parsing
  • no browser rendering
  • no hosted dashboards
  • no private product adapters
  • no proprietary product screens or private data

Design Principles

Structured Evals is built around a few assumptions:

  • evals are product infrastructure, not just tests
  • generic runners should stay small
  • task truth belongs in packs
  • failure modes should be named and inspectable
  • evaluators need calibration
  • reports should drive improvement, not just produce a score

About

A framework for evaluating AI systems against product-specific failure modes, not generic benchmark scores.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors