Simplify prompt testing + add Opus correctness reviewer#11
Merged
mattgodbolt merged 3 commits intomainfrom Feb 21, 2026
Merged
Conversation
Replace the complex evaluation framework (Claude-as-judge scoring, Flask web review UI, prompt advisor, 9 historical prompt versions) with a simple run-and-compare workflow: - `prompt-test run` sends test cases to the API and saves outputs - `prompt-test compare A B` shows side-by-side results for human review - `prompt-test list` shows available test cases Removed: - Claude-based automated scoring (ClaudeReviewer, PromptAdvisor) - Flask web review interface and templates - Human review JSONL collection system - Prompt publishing workflow with validation - 11 historical prompt YAML files - Flask and Jinja2 dependencies Kept: - All 21 curated test cases with real assembly - CE API client and test case enrichment - Core runner (simplified from 300 to 130 lines) - CLI (simplified from 500 to 180 lines) The automated scoring didn't catch real accuracy issues (e.g. the O(2^n) vs O(n) fibonacci error that Haiku makes). Human review of actual outputs is more effective for a teaching tool. -5,691 lines, +301 lines. 🤖 Generated by LLM (Claude, via OpenClaw)
Adds a focused correctness checker that uses Claude Opus to verify factual claims in explanations. Instead of abstract scoring dimensions, it identifies specific errors and warnings: - Instruction semantics (e.g., lea as address calc vs memory access) - Complexity/performance claims (e.g., O(2^n) vs O(n)) - Optimisation level characterisation - Register usage and calling conventions Usage: prompt-test run --review # Run + review in one step prompt-test review results.json # Review existing results Each issue is classified as error (would mislead a student) or warning (imprecise but not strictly wrong). 🤖 Generated by LLM (Claude, via OpenClaw)
Two test cases were hitting the 1024 limit and getting cut off mid-explanation. 1536 gives enough headroom without encouraging verbosity. 🤖 Generated by LLM (Claude, via OpenClaw)
There was a problem hiding this comment.
Pull request overview
This PR dramatically simplifies the prompt testing framework by removing complex evaluation infrastructure (~5,700 lines deleted) and replacing it with a focused correctness reviewer using Claude Opus (~300 lines added).
Changes:
- Removes Claude-as-judge abstract scoring system (5 numeric dimensions), Flask web review UI with localStorage, human review JSONL collection, prompt advisor/improvement engine, and 11 historical prompt YAML files
- Simplifies test runner to just collect outputs without automatic scoring
- Adds CorrectnessReviewer that uses Opus to check for specific factual errors (instruction semantics, complexity claims, optimization characterization, register usage)
- Removes Flask and Jinja2 dependencies
- Increases production prompt max_tokens from 1024 to 1536 (50% increase)
Reviewed changes
Copilot reviewed 31 out of 32 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| prompt_testing/runner.py | Simplified from 350 lines to 158; removed automatic scoring, just runs tests and collects outputs |
| prompt_testing/reviewer.py | New 148-line correctness checker using Opus with structured issue reporting |
| prompt_testing/cli.py | Simplified from 811 to 278 lines; new commands: run --review, review, compare |
| prompt_testing/file_utils.py | Moved load_all_test_cases from scorer.py, no functional changes |
| pyproject.toml | Removed flask and jinja2 dependencies |
| uv.lock | Updated lock file reflecting dependency removals |
| app/prompt.yaml | Increased max_tokens from 1024 to 1536 |
| prompt_testing/evaluation/* | Deleted entire evaluation module (scorer, reviewer, claude_reviewer, prompt_advisor, templates) |
| prompt_testing/templates/* | Deleted Flask web UI templates |
| prompt_testing/prompts/* | Deleted 11 historical prompt versions |
| prompt_testing/*.md | Deleted documentation for removed features |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This was referenced Feb 21, 2026
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two changes in one PR:
1. Simplify the testing framework (-5,691 lines, +301 lines)
Removes the complex evaluation system:
Keeps:
2. Add focused Opus correctness reviewer (+258 lines)
Instead of abstract scores, uses Claude Opus to check for specific factual errors:
leacorrectly described?)Each issue is classified as error (would mislead a student) or warning (imprecise but not wrong).
3. Bump max_tokens from 1024 to 1536
Two test cases (
edge_long_asm_001andloop_experienced_assembly) were hitting the 1024 output token limit and getting truncated mid-explanation. 1536 gives enough headroom for complex assembly without encouraging verbosity. Cost impact is minimal sincemax_tokensis a cap — most responses use 400-600 tokens.Why this approach
The old abstract scoring (accuracy: 0.67, relevance: 0.82...) didn't catch real errors like the O(2^n)→O(n) fibonacci mistake. The new reviewer asks specific questions about correctness and reports concrete issues.
Testing
(I'm Molty, an AI assistant acting on behalf of @mattgodbolt)