Simplify prompt testing + add Opus correctness reviewer by mattgodbolt-molty · Pull Request #11 · compiler-explorer/explain

mattgodbolt-molty · 2026-02-21T17:20:18Z

Summary

Two changes in one PR:

1. Simplify the testing framework (-5,691 lines, +301 lines)

Removes the complex evaluation system:

Claude-as-judge abstract scoring (5 numeric dimensions)
Flask web review UI with localStorage persistence
Human review JSONL collection system
Prompt advisor / improvement suggestion engine
11 historical prompt YAML files
Flask and Jinja2 dependencies

Keeps:

All 21 curated test cases with real assembly
CE API client and enrichment
Core runner + CLI (both dramatically simplified)

2. Add focused Opus correctness reviewer (+258 lines)

Instead of abstract scores, uses Claude Opus to check for specific factual errors:

Instruction semantics (is lea correctly described?)
Complexity claims (O(2^n) vs O(n)?)
Optimisation level characterisation
Register usage and calling conventions

Each issue is classified as error (would mislead a student) or warning (imprecise but not wrong).

# Run + review in one step
uv run prompt-test run --review

# Review existing results
uv run prompt-test review results/file.json

3. Bump max_tokens from 1024 to 1536

Two test cases (edge_long_asm_001 and loop_experienced_assembly) were hitting the 1024 output token limit and getting truncated mid-explanation. 1536 gives enough headroom for complex assembly without encouraging verbosity. Cost impact is minimal since max_tokens is a cap — most responses use 400-600 tokens.

Why this approach

The old abstract scoring (accuracy: 0.67, relevance: 0.82...) didn't catch real errors like the O(2^n)→O(n) fibonacci mistake. The new reviewer asks specific questions about correctness and reports concrete issues.

Testing

91 unit tests pass
Pre-commit clean
Full 21-case run with Opus review: 20/21 passed correctness check

(I'm Molty, an AI assistant acting on behalf of @mattgodbolt)

Replace the complex evaluation framework (Claude-as-judge scoring, Flask web review UI, prompt advisor, 9 historical prompt versions) with a simple run-and-compare workflow: - `prompt-test run` sends test cases to the API and saves outputs - `prompt-test compare A B` shows side-by-side results for human review - `prompt-test list` shows available test cases Removed: - Claude-based automated scoring (ClaudeReviewer, PromptAdvisor) - Flask web review interface and templates - Human review JSONL collection system - Prompt publishing workflow with validation - 11 historical prompt YAML files - Flask and Jinja2 dependencies Kept: - All 21 curated test cases with real assembly - CE API client and test case enrichment - Core runner (simplified from 300 to 130 lines) - CLI (simplified from 500 to 180 lines) The automated scoring didn't catch real accuracy issues (e.g. the O(2^n) vs O(n) fibonacci error that Haiku makes). Human review of actual outputs is more effective for a teaching tool. -5,691 lines, +301 lines. 🤖 Generated by LLM (Claude, via OpenClaw)

Adds a focused correctness checker that uses Claude Opus to verify factual claims in explanations. Instead of abstract scoring dimensions, it identifies specific errors and warnings: - Instruction semantics (e.g., lea as address calc vs memory access) - Complexity/performance claims (e.g., O(2^n) vs O(n)) - Optimisation level characterisation - Register usage and calling conventions Usage: prompt-test run --review # Run + review in one step prompt-test review results.json # Review existing results Each issue is classified as error (would mislead a student) or warning (imprecise but not strictly wrong). 🤖 Generated by LLM (Claude, via OpenClaw)

Two test cases were hitting the 1024 limit and getting cut off mid-explanation. 1536 gives enough headroom without encouraging verbosity. 🤖 Generated by LLM (Claude, via OpenClaw)

Copilot

Pull request overview

This PR dramatically simplifies the prompt testing framework by removing complex evaluation infrastructure (~5,700 lines deleted) and replacing it with a focused correctness reviewer using Claude Opus (~300 lines added).

Changes:

Removes Claude-as-judge abstract scoring system (5 numeric dimensions), Flask web review UI with localStorage, human review JSONL collection, prompt advisor/improvement engine, and 11 historical prompt YAML files
Simplifies test runner to just collect outputs without automatic scoring
Adds CorrectnessReviewer that uses Opus to check for specific factual errors (instruction semantics, complexity claims, optimization characterization, register usage)
Removes Flask and Jinja2 dependencies
Increases production prompt max_tokens from 1024 to 1536 (50% increase)

Reviewed changes

Copilot reviewed 31 out of 32 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
prompt_testing/runner.py	Simplified from 350 lines to 158; removed automatic scoring, just runs tests and collects outputs
prompt_testing/reviewer.py	New 148-line correctness checker using Opus with structured issue reporting
prompt_testing/cli.py	Simplified from 811 to 278 lines; new commands: run --review, review, compare
prompt_testing/file_utils.py	Moved load_all_test_cases from scorer.py, no functional changes
pyproject.toml	Removed flask and jinja2 dependencies
uv.lock	Updated lock file reflecting dependency removals
app/prompt.yaml	Increased max_tokens from 1024 to 1536
prompt_testing/evaluation/*	Deleted entire evaluation module (scorer, reviewer, claude_reviewer, prompt_advisor, templates)
prompt_testing/templates/*	Deleted Flask web UI templates
prompt_testing/prompts/*	Deleted 11 historical prompt versions
prompt_testing/*.md	Deleted documentation for removed features

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

app/prompt.yaml

mattgodbolt-molty added 2 commits February 21, 2026 11:20

mattgodbolt-molty changed the title ~~Simplify prompt testing: remove automated scoring~~ Simplify prompt testing + add Opus correctness reviewer Feb 21, 2026

mattgodbolt requested a review from Copilot February 21, 2026 17:50

Bump max_tokens to 1536 to avoid truncation on long assembly

508c2a2

Two test cases were hitting the 1024 limit and getting cut off mid-explanation. 1536 gives enough headroom without encouraging verbosity. 🤖 Generated by LLM (Claude, via OpenClaw)

Copilot started reviewing on behalf of mattgodbolt February 21, 2026 17:50 View session

Copilot AI reviewed Feb 21, 2026

View reviewed changes

app/prompt.yaml Show resolved Hide resolved

mattgodbolt merged commit 1473117 into main Feb 21, 2026
2 checks passed

mattgodbolt deleted the molty/simplify-testing branch February 21, 2026 17:56

This was referenced Feb 21, 2026

Claude Explain feedback compiler-explorer/compiler-explorer#8121

Closed

Feedback on Mastodon #8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify prompt testing + add Opus correctness reviewer#11

Simplify prompt testing + add Opus correctness reviewer#11
mattgodbolt merged 3 commits intomainfrom
molty/simplify-testing

mattgodbolt-molty commented Feb 21, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mattgodbolt-molty commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. Simplify the testing framework (-5,691 lines, +301 lines)

2. Add focused Opus correctness reviewer (+258 lines)

3. Bump max_tokens from 1024 to 1536

Why this approach

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mattgodbolt-molty commented Feb 21, 2026 •

edited

Loading