Skip to content

Simplify prompt testing + add Opus correctness reviewer#11

Merged
mattgodbolt merged 3 commits intomainfrom
molty/simplify-testing
Feb 21, 2026
Merged

Simplify prompt testing + add Opus correctness reviewer#11
mattgodbolt merged 3 commits intomainfrom
molty/simplify-testing

Conversation

@mattgodbolt-molty
Copy link
Contributor

@mattgodbolt-molty mattgodbolt-molty commented Feb 21, 2026

Summary

Two changes in one PR:

1. Simplify the testing framework (-5,691 lines, +301 lines)

Removes the complex evaluation system:

  • Claude-as-judge abstract scoring (5 numeric dimensions)
  • Flask web review UI with localStorage persistence
  • Human review JSONL collection system
  • Prompt advisor / improvement suggestion engine
  • 11 historical prompt YAML files
  • Flask and Jinja2 dependencies

Keeps:

  • All 21 curated test cases with real assembly
  • CE API client and enrichment
  • Core runner + CLI (both dramatically simplified)

2. Add focused Opus correctness reviewer (+258 lines)

Instead of abstract scores, uses Claude Opus to check for specific factual errors:

  • Instruction semantics (is lea correctly described?)
  • Complexity claims (O(2^n) vs O(n)?)
  • Optimisation level characterisation
  • Register usage and calling conventions

Each issue is classified as error (would mislead a student) or warning (imprecise but not wrong).

# Run + review in one step
uv run prompt-test run --review

# Review existing results
uv run prompt-test review results/file.json

3. Bump max_tokens from 1024 to 1536

Two test cases (edge_long_asm_001 and loop_experienced_assembly) were hitting the 1024 output token limit and getting truncated mid-explanation. 1536 gives enough headroom for complex assembly without encouraging verbosity. Cost impact is minimal since max_tokens is a cap — most responses use 400-600 tokens.

Why this approach

The old abstract scoring (accuracy: 0.67, relevance: 0.82...) didn't catch real errors like the O(2^n)→O(n) fibonacci mistake. The new reviewer asks specific questions about correctness and reports concrete issues.

Testing

  • 91 unit tests pass
  • Pre-commit clean
  • Full 21-case run with Opus review: 20/21 passed correctness check

(I'm Molty, an AI assistant acting on behalf of @mattgodbolt)

Replace the complex evaluation framework (Claude-as-judge scoring,
Flask web review UI, prompt advisor, 9 historical prompt versions)
with a simple run-and-compare workflow:

- `prompt-test run` sends test cases to the API and saves outputs
- `prompt-test compare A B` shows side-by-side results for human review
- `prompt-test list` shows available test cases

Removed:
- Claude-based automated scoring (ClaudeReviewer, PromptAdvisor)
- Flask web review interface and templates
- Human review JSONL collection system
- Prompt publishing workflow with validation
- 11 historical prompt YAML files
- Flask and Jinja2 dependencies

Kept:
- All 21 curated test cases with real assembly
- CE API client and test case enrichment
- Core runner (simplified from 300 to 130 lines)
- CLI (simplified from 500 to 180 lines)

The automated scoring didn't catch real accuracy issues (e.g. the
O(2^n) vs O(n) fibonacci error that Haiku makes). Human review of
actual outputs is more effective for a teaching tool.

-5,691 lines, +301 lines.

🤖 Generated by LLM (Claude, via OpenClaw)
Adds a focused correctness checker that uses Claude Opus to verify
factual claims in explanations. Instead of abstract scoring dimensions,
it identifies specific errors and warnings:

- Instruction semantics (e.g., lea as address calc vs memory access)
- Complexity/performance claims (e.g., O(2^n) vs O(n))
- Optimisation level characterisation
- Register usage and calling conventions

Usage:
  prompt-test run --review              # Run + review in one step
  prompt-test review results.json       # Review existing results

Each issue is classified as error (would mislead a student) or
warning (imprecise but not strictly wrong).

🤖 Generated by LLM (Claude, via OpenClaw)
@mattgodbolt-molty mattgodbolt-molty changed the title Simplify prompt testing: remove automated scoring Simplify prompt testing + add Opus correctness reviewer Feb 21, 2026
@mattgodbolt mattgodbolt requested a review from Copilot February 21, 2026 17:50
Two test cases were hitting the 1024 limit and getting cut off
mid-explanation. 1536 gives enough headroom without encouraging
verbosity.

🤖 Generated by LLM (Claude, via OpenClaw)
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR dramatically simplifies the prompt testing framework by removing complex evaluation infrastructure (~5,700 lines deleted) and replacing it with a focused correctness reviewer using Claude Opus (~300 lines added).

Changes:

  • Removes Claude-as-judge abstract scoring system (5 numeric dimensions), Flask web review UI with localStorage, human review JSONL collection, prompt advisor/improvement engine, and 11 historical prompt YAML files
  • Simplifies test runner to just collect outputs without automatic scoring
  • Adds CorrectnessReviewer that uses Opus to check for specific factual errors (instruction semantics, complexity claims, optimization characterization, register usage)
  • Removes Flask and Jinja2 dependencies
  • Increases production prompt max_tokens from 1024 to 1536 (50% increase)

Reviewed changes

Copilot reviewed 31 out of 32 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
prompt_testing/runner.py Simplified from 350 lines to 158; removed automatic scoring, just runs tests and collects outputs
prompt_testing/reviewer.py New 148-line correctness checker using Opus with structured issue reporting
prompt_testing/cli.py Simplified from 811 to 278 lines; new commands: run --review, review, compare
prompt_testing/file_utils.py Moved load_all_test_cases from scorer.py, no functional changes
pyproject.toml Removed flask and jinja2 dependencies
uv.lock Updated lock file reflecting dependency removals
app/prompt.yaml Increased max_tokens from 1024 to 1536
prompt_testing/evaluation/* Deleted entire evaluation module (scorer, reviewer, claude_reviewer, prompt_advisor, templates)
prompt_testing/templates/* Deleted Flask web UI templates
prompt_testing/prompts/* Deleted 11 historical prompt versions
prompt_testing/*.md Deleted documentation for removed features

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mattgodbolt mattgodbolt merged commit 1473117 into main Feb 21, 2026
2 checks passed
@mattgodbolt mattgodbolt deleted the molty/simplify-testing branch February 21, 2026 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants