diff --git a/app/prompt.yaml b/app/prompt.yaml
index aee3fc7..0b8fe39 100644
--- a/app/prompt.yaml
+++ b/app/prompt.yaml
@@ -2,7 +2,7 @@ name: Production Sonnet 4.6
 description: Sonnet 4.6 with deduplicated, tighter prompts
 model:
   name: claude-sonnet-4-6
-  max_tokens: 1024
+  max_tokens: 1536
   temperature: 0.0
 audience_levels:
   beginner:
diff --git a/prompt_testing/AUDIENCE_GUIDE.md b/prompt_testing/AUDIENCE_GUIDE.md
deleted file mode 100644
index 6893bae..0000000
--- a/prompt_testing/AUDIENCE_GUIDE.md
+++ /dev/null
@@ -1,97 +0,0 @@
-# Audience and Explanation Type Support in Prompt Testing
-
-This guide describes the audience and explanation type capabilities in the prompt testing framework.
-
-## Overview
-
-The prompt testing system supports testing prompts across different audiences and explanation types, matching the capabilities in the main explain API.
-
-### Audience Levels
-- **beginner**: Simple language, technical terms defined, step-by-step explanations
-- **experienced**: Assumes assembly knowledge, focuses on compiler behavior, optimizations, and architectural details
-
-### Explanation Types
-- **assembly**: Focus on assembly instructions and their purpose
-
-## Test Case Format
-
-Test cases can now specify audience and explanation type:
-
-```yaml
-cases:
-  - id: example_beginner_assembly
-    audience: beginner           # Optional, defaults to "beginner"
-    explanation_type: assembly   # Optional, defaults to "assembly"
-    expected_topics_by_audience: # Optional audience-specific expectations
-      beginner: [basic_concepts, simple_terms]
-      experienced: [compiler_behavior, register_usage, microarchitecture, advanced_optimizations]
-    # ... rest of test case
-```
-
-## Running Tests
-
-### Filter by audience:
-```bash
-uv run prompt-test run --prompt current --audience beginner
-```
-
-### Filter by explanation type:
-```bash
-uv run prompt-test run --prompt current --explanation-type optimization
-```
-
-### Combine filters:
-```bash
-uv run prompt-test run --prompt current --audience experienced --explanation-type optimization
-```
-
-## Scoring Adjustments
-
-The automatic scorer now adjusts expectations based on audience:
-
-1. **Clarity scoring**:
-   - Beginners: Shorter sentences, fewer technical terms, more explanatory language
-   - Experienced: Can handle longer sentences and more technical terminology
-
-2. **Length scoring**:
-   - Different target lengths for each audience
-   - Adjusted by explanation type (source mapping needs more space)
-
-3. **Topic coverage**:
-   - Uses audience-specific expected topics when available
-
-## Prompt Templates
-
-The prompt templates now include audience and explanation variables:
-
-```yaml
-system_prompt: |
-  You are an expert in {arch} assembly code and {language}...
-
-  Target audience: {audience}
-  {audience_guidance}
-
-  Explanation type: {explanation_type}
-  {explanation_focus}
-
-  # ... rest of prompt
-
-user_prompt: "Explain the {arch} {explanation_type_phrase}."
-```
-
-## Migration Notes
-
-- Existing test cases without audience/explanation fields default to "beginner" and "assembly"
-- The v1_baseline.yaml prompt remains unchanged for comparison purposes
-
-## Example Test Cases
-
-See `test_cases/audience_variations.yaml` for examples demonstrating different audience and explanation type combinations.
-
-## Simplified Audience System
-
-The system now uses just two audience levels:
-- **beginner**: For users new to assembly language
-- **experienced**: For users with assembly and compiler knowledge
-
-This simplified approach allows focus on perfecting explanations for these two clear segments while maintaining the technical infrastructure for future expansion.
diff --git a/prompt_testing/README.md b/prompt_testing/README.md
index 6ca9042..2e8e75c 100644
--- a/prompt_testing/README.md
+++ b/prompt_testing/README.md
@@ -1,491 +1,92 @@
-# Prompt Testing Framework
+# Prompt Testing
 
-A comprehensive framework for testing and iterating on prompts for the Claude explain service.
-
-## Overview
-
-This framework allows you to:
-- Test prompts against a curated set of assembly/source code examples
-- Compare different prompt versions automatically
-- Score responses using Claude's AI evaluation and human review
-- Track performance over time and identify regressions
+Simple framework for testing prompt changes against curated test cases.
 
 ## Quick Start
 
 ```bash
-# List available test cases and prompts
-uv run prompt-test list
+# Run all test cases with the current production prompt
+uv run prompt-test run
 
-# Run all test cases with current prompt
-uv run prompt-test run --prompt current
+# Run with Opus correctness review (catches factual errors)
+uv run prompt-test run --review
 
-# Run specific category of tests
-uv run prompt-test run --prompt current --categories basic_optimizations
+# Run specific cases or categories
+uv run prompt-test run --cases basic_loop_001 --cases basic_inline_001
+uv run prompt-test run --categories loop_optimization
 
-# Compare two prompt versions
-uv run prompt-test run --prompt current --compare v1_baseline
+# Review existing results with Opus
+uv run prompt-test review results/20250221_120000_current.json
 
-# Analyze all previous results
-uv run prompt-test analyze
+# Compare two result files
+uv run prompt-test compare results_a.json results_b.json
 
-# Get AI-powered prompt improvement suggestions
-uv run prompt-test improve --prompt current
+# List available test cases
+uv run prompt-test list
 ```
 
-## Directory Structure
+## How It Works
 
-```
-prompt_testing/
-├── test_cases/           # Test case definitions (YAML files)
-│   ├── basic_optimizations.yaml
-│   ├── complex_transformations.yaml
-│   └── edge_cases.yaml
-├── prompts/              # Prompt versions (YAML files)
-│   ├── v1_baseline.yaml
-│   └── v2_improved.yaml
-│   # Note: 'current' prompt is loaded from app/prompt.yaml
-├── results/              # Test results and analysis
-│   └── [timestamp]_[prompt_version].json
-└── evaluation/           # Scoring and review tools
-    ├── scorer.py        # Test case loading utilities
-    ├── claude_reviewer.py # Claude-based AI scoring
-    ├── prompt_advisor.py # Prompt improvement suggestions
-    ├── reviewer.py       # Human review tools
-    └── review_templates.yaml # Customizable evaluation criteria
-```
+1. **Test cases** live in `test_cases/*.yaml` — each has source code, compiler flags, and real assembly output
+2. `prompt-test run` sends each case to the Claude API using the current prompt and saves all outputs
+3. `--review` flag runs each output through Opus for **correctness checking** — it identifies specific factual errors rather than giving abstract scores
+4. You read the outputs (and any flagged issues) and decide if they're good
+5. To compare prompt changes: run once before, once after, then `prompt-test compare`
 
-## Test Case Format
+### Correctness Review
 
-Test cases are defined in YAML files with the following structure:
+The `--review` flag uses Claude Opus to check explanations for factual errors. Unlike generic scoring, it looks for specific issues:
 
-```yaml
-description: "Test case category description"
+- **Instruction semantics**: Is `lea` correctly described as address computation, not memory access?
+- **Complexity claims**: Does it claim O(n) when it's actually O(2^n)?
+- **Optimisation characterisation**: Does it correctly identify unoptimised code?
+- **Register usage**: Are calling conventions right?
+
+Each issue is flagged as an **error** (would mislead a student) or **warning** (imprecise but not wrong).
+
+## Test Case Format
 
+```yaml
 cases:
-  - id: unique_case_id
-    category: optimization_type
-    quality: good_example  # good_example|bad_example|challenging_example
-    description: "Human readable description"
+  - id: unique_id
+    category: loop_optimization
+    description: "What this tests"
+    audience: beginner          # or experienced
+    explanation_type: assembly  # or haiku
     input:
       language: C++
-      compiler: "gcc 13.1"
+      compiler: "x86-64 gcc 13.1"
       compilationOptions: ["-O2"]
       instructionSet: x86_64
       code: |
-        source code here
+        int foo() { return 42; }
       asm:
-        - text: "mov eax, edi"
-          address: 1
-          source:
-            line: 2
-    expected_topics: [vectorization, loop_optimization]
-    difficulty: beginner  # beginner|experienced
-```
-
-## Prompt Format
-
-Prompts are defined in YAML files that must match the production schema. Here's the required structure:
-
-```yaml
-# Model configuration (required)
-model:
-  name: claude-3-5-haiku-20241022  # Model to use
-  max_tokens: 1024                  # Maximum response tokens
-  temperature: 0.0                  # Sampling temperature
-
-# Audience levels (required)
-audience_levels:
-  beginner:
-    description: "Novice programmers"
-    guidance: "Use simple language, avoid jargon..."
-  experienced:
-    description: "Experienced developers and systems programmers"
-    guidance: "Assume familiarity with programming concepts, use technical terminology appropriately..."
-
-# Explanation types (required)
-explanation_types:
-  assembly:
-    description: "Focus on assembly instructions"
-    focus_areas:
-      - "Instruction-by-instruction breakdown"
-      - "Register usage and memory access"
-    user_prompt_phrase: "assembly output"
-  optimization:
-    description: "Focus on compiler optimizations"
-    focus_areas:
-      - "Optimization techniques applied"
-      - "Performance implications"
-    user_prompt_phrase: "compiler optimizations"
-
-# Prompt templates (required)
-system_prompt: |
-  You are an expert in {arch} assembly code and {language}...
-
-  Audience: {audience_description}
-  {audience_guidance}
-
-  Explanation Type: {explanation_type_description}
-  Focus on: {explanation_type_focus}
-
-user_prompt: "Explain the {arch} {explanation_type_phrase} for this {language} code."
-
-assistant_prefill: "I'll analyze the {explanation_type_phrase} and explain it for {audience_level} level."
-```
-
-**Important**: Test prompts must include ALL these fields to work in production. The `current` special prompt loads from `app/prompt.yaml` which contains the complete structure.
-
-## Evaluation Metrics
-
-The framework uses Claude-based AI scoring to provide comprehensive evaluation of prompt responses.
-
-### How Claude Scoring Works
-
-Claude evaluates each response on five key dimensions, as defined in `review_templates.yaml`:
-
-1. **Technical Accuracy** (30% weight):
-   - Are assembly instructions correctly explained?
-   - Are compiler optimizations accurately described?
-   - Are technical claims verifiable and correct?
-   - Does it avoid oversimplifications that lead to inaccuracy?
-   - Are register names, instruction mnemonics, and calling conventions correct?
-
-2. **Educational Value** (25% weight):
-   - Is the explanation at an appropriate level for the target audience?
-   - Does it build understanding progressively?
-   - Are complex concepts explained clearly?
-   - Does it provide insight into why the compiler made certain choices?
-   - Would a reader gain actionable knowledge?
-
-3. **Clarity & Structure** (20% weight):
-   - Is the explanation well-organized and easy to follow?
-   - Are technical terms properly introduced before use?
-   - Is the language clear and concise?
-   - Does it avoid unnecessary jargon while maintaining precision?
-   - Is there a logical flow from simple to complex concepts?
-
-4. **Completeness** (15% weight):
-   - Does it address all significant transformations in the assembly?
-   - Are important optimizations explained?
-   - Does it cover the key differences between source and assembly?
-   - Is the scope appropriate (not too narrow or too broad)?
-   - Are edge cases or special behaviors noted where relevant?
-
-5. **Practical Insights** (10% weight):
-   - Does it help developers understand performance implications?
-   - Are there actionable insights about writing better code?
-   - Does it explain when/why certain optimizations occur?
-   - Does it connect assembly behavior to source code patterns?
-
-### Benefits of Claude Scoring
-
-- **Context-Aware**: Understands relationships between concepts and code
-- **Nuanced Evaluation**: Catches subtle technical errors that pattern matching would miss
-- **Educational Assessment**: Evaluates pedagogical effectiveness, not just correctness
-- **Detailed Feedback**: Provides specific strengths, weaknesses, and improvement suggestions
-- **Consistent Standards**: Uses the same high-quality evaluation criteria for all test cases
-
-### Configuring the Claude Reviewer
-
-```bash
-# Use the default Claude Sonnet model
-uv run prompt-test run --prompt current
-
-# Use a different Claude model for review
-uv run prompt-test run --prompt current --reviewer-model claude-3-5-sonnet-20241022
+        - text: "foo():"
+        - text: "        mov     eax, 42"
+          source: { line: 1 }
+        - text: "        ret"
+          source: { line: 1 }
 ```
 
-## Example Output
+## Enriching Test Cases
 
-### Test Run Output
-```
-Running 3 test cases with prompt version: current
-  ✓ basic_loop_001: 0.85
-  ✓ basic_inline_001: 0.92
-  ✓ complex_vectorization_001: 0.78
-
-Summary for current:
-  Success rate: 100.0%
-  Cases: 3/3
-  Average score: 0.85
-  Average accuracy: 0.87
-  Average clarity: 0.83
-  Average tokens: 420
-  Average response time: 2845ms
-
-Detailed results saved to: prompt_testing/results/20241201_120000_current.json
-```
-
-### Detailed Feedback Example
-When examining the results file, Claude's evaluation includes:
-- Missing topics: ["SIMD instruction specifics", "Memory alignment considerations"]
-- Incorrect claims: ["The compiler always unrolls this loop"]
-- Strengths: ["Clear explanation of function inlining", "Good use of beginner-friendly language"]
-- Weaknesses: ["Could explain register allocation choices", "Missing performance implications"]
-- Overall assessment: "Strong foundational explanation but lacks some advanced optimization details"
-
-### Improvement Suggestions Output
-```
-=== PROMPT IMPROVEMENT SUGGESTIONS ===
-
-Average Score: 0.87
-
-🎯 Priority Improvements:
-  Issue: Incorrect claims about optimization techniques
-  Current: 'Be precise and accurate about CPU features...'
-  Suggested: 'Be precise and accurate about all compiler optimizations...'
-  Rationale: Expanded guidance prevents misidentifying optimizations
-
-✨ Expected Impact: Score should improve from 0.87 to 0.90+
-```
-
-## Usage Examples
-
-### Running Tests
+If you have test cases without assembly, the `enrich` command fetches real output from the Compiler Explorer API:
 
 ```bash
-# Test current prompt on all cases
-uv run prompt-test run --prompt current
-
-# Test specific cases
-uv run prompt-test run --prompt current --cases basic_loop_001 edge_empty_001
-
-# Test by category
-uv run prompt-test run --prompt current --categories loop_optimization vectorization
-
-# Compare two versions
-uv run prompt-test run --prompt v1_baseline --compare current --categories basic_optimizations
+uv run prompt-test enrich -i test_cases/new_tests.yaml
 ```
 
-### Managing Prompts
-
-1. Create a new prompt version:
-   ```bash
-   cp app/prompt.yaml prompt_testing/prompts/v3_experiment.yaml
-   # Edit the new prompt
-   ```
-
-2. Test the new prompt:
-   ```bash
-   uv run prompt-test run --prompt v3_experiment --compare current
-   ```
-
-3. Validate the new prompt structure:
-   ```bash
-   # Ensure the prompt loads correctly in production code
-   uv run python -c "from app.prompt import Prompt; Prompt.from_yaml('prompt_testing/prompts/v3_experiment.yaml')"
-   ```
-
-4. If it performs better AND validates, publish to production:
-   ```bash
-   # Automated deployment with validation and safety checks
-   uv run prompt-test publish --prompt v3_experiment --name "Production v4"
-
-   # Manual deployment (alternative, but automated is recommended)
-   # cp prompt_testing/prompts/v3_experiment.yaml app/prompt.yaml
-   # uv run pytest app/test_explain.py::test_process_request_success
-   ```
-
-### Adding Test Cases
-
-1. Add new cases to existing YAML files or create new category files
-2. Include realistic assembly output (you can use Compiler Explorer to generate examples)
-3. Specify expected topics that a good explanation should cover
-4. Test your new cases to ensure they work as expected
-
-### Prompt Improvement Workflow
-
-1. Run tests with Claude scoring to get detailed feedback:
-   ```bash
-   uv run prompt-test run --prompt current --scorer claude
-   ```
-
-2. Analyze results and get improvement suggestions:
-   ```bash
-   uv run prompt-test improve --prompt current
-   ```
-
-3. Create an experimental improved version:
-   ```bash
-   uv run prompt-test improve --prompt current --create-improved --output current_v2
-   ```
-
-4. Test the improved version:
-   ```bash
-   uv run prompt-test run --prompt current_v2 --scorer claude --compare current
-   ```
-
-5. Review deployment criteria:
-   - **Score improvement**: New prompt should score higher (e.g., 0.85+ average)
-   - **No regressions**: No individual test case should drop significantly
-   - **Cost efficiency**: Token usage should not increase dramatically
-   - **Error-free**: All test cases should complete without errors
-
-6. If ALL criteria are met, deploy to production:
-   ```bash
-   # Validate prompt structure
-   uv run python -c "from app.prompt import Prompt; Prompt.from_yaml('prompt_testing/prompts/current_v2.yaml')"
-
-   # Deploy (git tracks previous versions)
-   cp prompt_testing/prompts/current_v2.yaml app/prompt.yaml
-
-   # Test production integration
-   uv run pytest app/test_explain.py
-   ```
-
-### Human Review Workflow
-
-1. Run tests to generate results:
-   ```bash
-   uv run prompt-test run --prompt current --output my_test_results.json
-   ```
-
-2. Review results interactively via web interface:
-   ```bash
-   uv run prompt-test review --results-file prompt_testing/results/my_test_results.json
-   ```
-
-   **Features:**
-   - Visual status indicators (✅ reviewed, ⚪ pending) with colored borders
-   - Progress tracking with animated completion bar
-   - Side-by-side source code and assembly display
-   - Form pre-population with existing review data
-   - Update functionality for modifying reviews
-   - localStorage persistence for reviewer information
-   - 1-5 scale metrics aligned with human evaluation standards
-   - Line-separated input format for natural feedback entry
-
-3. Analyze review data:
-   ```bash
-   uv run prompt-test analyze
-   ```
-
-### Production Deployment Workflow
-
-Once you've tested and validated a prompt, publish it to production:
-
-```bash
-# Automated deployment with full validation and safety checks
-uv run prompt-test publish --prompt v7 --name "Production v8"
-
-# Or let it auto-generate a production name
-uv run prompt-test publish --prompt v7
-```
-
-**The publish command automatically:**
-- ✅ **Cleans metadata**: Removes experiment_metadata and cleans up names/descriptions
-- ✅ **Validates structure**: Ensures prompt loads correctly in main service
-- ✅ **Tests compatibility**: Verifies message generation works end-to-end
-- ✅ **Creates backup**: Automatically backs up existing production prompt
-- ✅ **Runs integration tests**: Executes full test suite to catch regressions
-- ✅ **Provides guidance**: Clear next steps for local testing and committing
-
-**Safety features:**
-- Temp file handling with automatic cleanup on errors
-- Error rollback if validation fails
-- Comprehensive error reporting for debugging
-
-## Best Practices
-
-### Test Case Creation
-
-- **Use real examples**: Generate assembly from actual code using Compiler Explorer
-- **Cover edge cases**: Include empty functions, undefined behavior, truncated assembly
-- **Vary difficulty**: Mix beginner and experienced examples
-- **Quality labels**: Mark cases as good/bad examples to test robustness
-
-### Prompt Development
-
-- **Start small**: Test on a subset before running full suite
-- **Iterate quickly**: Make small changes and measure impact
-- **Use version control**: Keep track of what works and what doesn't
-- **Document changes**: Note the reasoning behind prompt modifications
-
-### Evaluation Strategy
-
-- **Combine metrics**: Use both automatic and human evaluation
-- **Regular testing**: Run tests after any prompt changes
-- **Baseline comparison**: Always compare against previous best version
-- **Long-term tracking**: Monitor performance trends over time
-
-## Integration with Main Service
-
-The testing framework uses the same core logic as the main explain service:
-- Same request/response format (`ExplainRequest`/`ExplainResponse`)
-- Same data preparation (`prepare_structured_data`)
-- Same Claude API calls
-- Same token counting and cost calculation
-
-This ensures that test results accurately reflect production performance.
-
-## Troubleshooting
-
-### Common Issues
-
-1. **API Key Issues**: Ensure `ANTHROPIC_API_KEY` environment variable is set
-2. **Missing Test Cases**: Run `uv run prompt-test list` to see available cases
-3. **Import Errors**: Make sure you're running from the project root directory
-4. **Permission Errors**: Check that results directory is writable
-
-### Performance Considerations
-
-- **Rate Limits**: The framework respects Anthropic API rate limits
-- **Parallel Testing**: Currently runs sequentially (could be parallelized)
-- **Token Usage**: Monitor costs when running large test suites
-
-### Error Handling
-
-The framework uses **fail-fast error propagation**:
-- No silent failures or fallbacks that hide issues
-- Full stack traces for debugging
-- Errors immediately bubble up rather than being caught and logged
-- This ensures you always know when something goes wrong
-
-Example: If Claude API fails during scoring, the entire test run stops with a clear error.
-
-For more help, check the CLI help:
-```bash
-uv run prompt-test --help
-uv run prompt-test run --help
-```
-
-## Constitutional AI Approach
-
-The Claude-based scoring implements a "constitutional AI" approach where:
-
-1. **Fast Model Generates**: Claude Haiku or Sonnet generates explanations quickly
-2. **Advanced Model Reviews**: Claude Sonnet 4.0 (default) reviews outputs with deep analysis
-3. **Feedback Loop**: Review scores guide prompt improvements
-4. **Self-Improving**: The system learns what makes good explanations
-
-This approach enables:
-- **Scalable Quality**: Fast generation with quality assurance
-- **Objective Metrics**: AI-based evaluation reduces human bias
-- **Continuous Improvement**: Data-driven prompt optimization
-- **Cost Efficiency**: Only use expensive models for evaluation
-
-### Model Configuration
-
-Default models:
-- **Generation**: claude-3-5-haiku-20241022 (from main service)
-- **Review**: claude-sonnet-4-0 (for evaluation)
-- **Improvement**: claude-sonnet-4-0 (for suggestions)
+## Directory Structure
 
-You can override the review model:
-```bash
-uv run prompt-test run --prompt current --scorer claude --reviewer-model claude-3-opus-20240229
 ```
-
-### Customizing Review Criteria
-
-You can customize how Claude evaluates responses by editing `evaluation/review_templates.yaml`:
-
-```yaml
-custom_review:
-  system_prompt: "Your custom reviewer instructions..."
-  evaluation_dimensions:
-    your_dimension:
-      weight: 0.30
-      description: "What to evaluate..."
+prompt_testing/
+├── test_cases/       # Curated test cases (YAML)
+├── results/          # Saved test run outputs (JSON, gitignored)
+├── ce_api/           # Compiler Explorer API client
+├── runner.py         # Test runner
+├── reviewer.py       # Opus correctness checker
+├── cli.py            # CLI commands
+├── enricher.py       # CE API enrichment
+├── file_utils.py     # File I/O helpers
+└── yaml_utils.py     # YAML helpers
 ```
-
-This allows domain-specific evaluation criteria without code changes.
diff --git a/prompt_testing/WHATS_NEXT.md b/prompt_testing/WHATS_NEXT.md
deleted file mode 100644
index 8414e79..0000000
--- a/prompt_testing/WHATS_NEXT.md
+++ /dev/null
@@ -1,154 +0,0 @@
-# What's Next - Prompt Testing Framework Improvements
-
-## Overview
-
-This document outlines the next steps for improving the prompt testing framework based on audit findings and recent development work.
-
-## Recently Completed ✅
-
-### Web Review Interface (Latest)
-- **Fixed HTML review interface** - Replaced string concatenation with Flask + Jinja2
-- **Added markdown rendering** - AI responses now display with proper formatting using client-side marked.js
-- **Fixed template errors** - Resolved "dict has no attribute request" by enriching results with test case data
-- **Improved result descriptions** - Clear labels like "Current Production Prompt - 12 cases" instead of "unknown"
-- **Added CSS styling** - Proper code block, header, and list formatting
-- **Interactive web server** - `uv run prompt-test review --interactive` launches Flask app on localhost:5001
-- **COMPLETED: Quality of Life Improvements** ✅
-  * Phase 1: localStorage reviewer persistence + 1-5 metrics scale alignment
-  * Phase 2: Side-by-side source/assembly code display with responsive grid
-  * Phase 3: Line-separated input format (more natural than comma-separated)
-  * Phase 4: Review status indicators + progress tracking + update functionality
-  * Professional review management system with visual status, form pre-population, and real-time updates
-
-### Automated Prompt Publishing System (Latest)
-- **✅ COMPLETED: Production deployment automation** - `uv run prompt-test publish --prompt <version>`
-- **Metadata cleanup** - Automatically removes experiment_metadata and cleans names/descriptions for production
-- **Built-in validation** - Ensures prompt loads correctly in main service and can generate messages
-- **Safety features** - Automatic backup, temp file handling, error rollback, integration test execution
-- **Professional workflow** - Clear next steps guidance and comprehensive error reporting
-
-### Prompt Improvement System Audit & Fixes
-- **Fixed critical "current" prompt loading bug** - PromptOptimizer now handles "current" → `app/prompt.yaml` mapping
-- **Verified PromptAdvisor functionality** - Claude-based analysis with structured JSON suggestions working
-- **Tested improvement workflow** - `uv run prompt-test improve --prompt current` now works correctly
-- **Found existing analysis files** - Comprehensive suggestions in `/results/analysis_*` files with specific improvements
-
-### Earlier Improvements
-- Added support for calling CE REST API to fetch assembly output
-- Test cases can now have empty `asm` blocks that get populated automatically
-- Created `uv run prompt-test enrich` command to fetch real assembly data
-- Support for different compiler versions and optimization flags
-- Added error handling for JSON parsing in `claude_reviewer.py`
-- Added tests for `scorer.py` (test case loading)
-- Added S3-based caching for explanation responses
-- Migrated to Claude-only scoring (removed AutomaticScorer and HybridScorer)
-
-## Immediate Priority Actions
-
-### 1. **Integrate Human Review Data into Improvement Workflow**
-**Priority**: High - Critical gap in feedback loop
-
-**Current State**:
-- Web interface collects human reviews in JSONL format via ReviewManager
-- PromptAdvisor only uses automated Claude reviewer metrics
-- No integration between human feedback and improvement suggestions
-
-**Actions Needed**:
-1. Modify `PromptAdvisor.analyze_results_and_suggest_improvements()` to accept human review data
-2. Add human review aggregation alongside automated metrics in analysis prompt
-3. Update CLI improve command to load and pass human reviews to advisor
-4. Create unified feedback format combining human + automated reviews
-
-### 2. **Add File Selection Intelligence for Improve Command**
-**Priority**: Medium - Usability improvement
-
-**Current Issue**:
-- `uv run prompt-test improve --prompt current` uses "most recent results" which could be comparison/analysis files
-- No filtering by prompt version when multiple results exist
-
-**Actions Needed**:
-1. Add logic to skip analysis files (containing "analysis_" or "comparison_")
-2. Filter results by prompt version to avoid cross-contamination
-3. Prefer newer timestamp files when multiple valid options exist
-4. Add `--results-file` flag for manual override when needed
-
-### 3. **Add Iteration Tracking and Version History**
-**Priority**: Medium - Proper development workflow
-
-**Current Gap**: No tracking of prompt improvement lineage or performance over time
-
-**Actions Needed**:
-1. Add version history metadata to prompt YAML files
-2. Track what each version builds on (parent version, improvements applied)
-3. Performance tracking across iterations (score trends, regression detection)
-4. Add `prompt-test history` command to show improvement lineage
-
-## System Architecture Status
-
-### ✅ Working Components:
-- **Core testing infrastructure** - PromptTester, test case loading, Claude API integration
-- **Claude-based evaluation** - ClaudeReviewer with structured feedback
-- **Prompt improvement advisor** - PromptAdvisor with JSON-structured suggestions
-- **Web review interface** - Flask app with markdown rendering and test case enrichment
-- **Human review collection** - JSONL storage via ReviewManager
-- **File utilities** - YAML handling, result storage, enrichment logic
-
-### 🔧 Integration Gaps:
-- **Human reviews → Automated improvements** - Reviews collected but not fed into PromptAdvisor
-- **File selection logic** - Manual results file selection required
-- **Version tracking** - No lineage or performance history
-- **Cross-prompt contamination** - No filtering to avoid using wrong results files
-
-## Discovered Analysis Files
-
-Several comprehensive analysis files already exist with detailed improvement suggestions:
-
-1. `/results/analysis_current_20250524_135349_current.json` - Production prompt analysis (0.67 avg score)
-   - Focus on verification steps, pattern identification, missing topics coverage
-
-2. `/results/analysis_v1_baseline_comparison_current_vs_v1_baseline.json` - Baseline comparison
-   - Structured prompt improvements, prefill changes, user prompt restructuring
-
-These contain actionable suggestions that could be applied to create improved prompt versions.
-
-## Future Enhancements
-
-### Performance & Scalability
-1. Parallel evaluation for cost-effective Claude review
-2. Add progress indicators for long-running operations
-3. Cost tracking and budgeting features
-
-### Integration & Automation
-1. CI/CD pipeline integration for prompt regression testing
-2. **✅ COMPLETED: Automated deployment with validation and rollback**
-   - Added `uv run prompt-test publish --prompt <version>` command
-   - Automatic metadata cleanup (removes experiment_metadata, cleans names/descriptions)
-   - Built-in validation that prompt loads correctly in main service
-   - Message generation testing to ensure compatibility
-   - Automatic backup of existing production prompt
-   - Integration test execution with clear pass/fail reporting
-   - Safety features: temp file handling, error rollback, clear next steps
-3. Performance monitoring and regression detection
-
-## Technical Debt
-
-### Code Quality
-1. Add tests for `prompt_advisor.py` (requires mocking Claude API)
-2. Add tests for web review interface components
-3. Improve error handling in file selection logic
-4. Add type hints for human review integration
-
-## Next Session Recommendations
-
-**For immediate continuation**:
-1. **Human review integration** - Highest impact, completes feedback loop
-2. **File selection intelligence** - Prevents workflow errors, improves usability
-3. **Apply existing analysis suggestions** - Use discovered analysis files to create improved prompt versions
-4. **Version tracking implementation** - Foundation for proper iterative development
-
-**Key Files for Human Review Integration**:
-- `prompt_testing/evaluation/prompt_advisor.py:25` - `analyze_results_and_suggest_improvements()`
-- `prompt_testing/evaluation/reviewer.py` - ReviewManager and HumanReview classes
-- `prompt_testing/cli.py` - improve command implementation
-
-The human review integration is the critical missing piece that would create a complete feedback loop from web interface → analysis → improved prompts.
diff --git a/prompt_testing/cli.py b/prompt_testing/cli.py
index 6aef26d..82a39ed 100644
--- a/prompt_testing/cli.py
+++ b/prompt_testing/cli.py
@@ -1,811 +1,278 @@
-#!/usr/bin/env python3
-"""
-Command-line interface for prompt testing framework.
+"""CLI for prompt testing.
+
+Simple commands:
+  prompt-test run                  Run all test cases, save results
+  prompt-test run --cases foo bar  Run specific cases
+  prompt-test run --review         Also run Opus correctness review
+  prompt-test review results.json  Review existing results with Opus
+  prompt-test compare A B          Compare two result files side by side
+  prompt-test list                 List available test cases
+  prompt-test enrich               Enrich test cases with real CE assembly
+  prompt-test compilers            List CE compilers
 """
 
 import asyncio
 import json
 import sys
+from collections import defaultdict
 from pathlib import Path
 
 import click
 from dotenv import load_dotenv
 
-from app.explain_api import AssemblyItem, ExplainRequest
-from app.explanation_types import AudienceLevel, ExplanationType
-from app.prompt import Prompt
 from prompt_testing.ce_api import CompilerExplorerClient
 from prompt_testing.enricher import TestCaseEnricher
-from prompt_testing.evaluation.prompt_advisor import PromptOptimizer
-from prompt_testing.evaluation.reviewer import ReviewManager, create_simple_review_cli
-from prompt_testing.evaluation.scorer import load_all_test_cases
-from prompt_testing.file_utils import (
-    find_latest_results_file,
-    save_json_results,
-)
+from prompt_testing.file_utils import load_all_test_cases
 from prompt_testing.runner import PromptTester
-from prompt_testing.web_review import start_review_server
 
-# Load environment variables from .env file
 load_dotenv()
 
 
-@click.group(
-    help="Prompt testing framework for Claude explain service",
-    epilog="""
-Examples:
-
-  # Run basic optimization tests with current prompt
-  uv run prompt-test run --prompt current --categories basic_optimizations
-
-  # Compare two prompt versions
-  uv run prompt-test run --prompt v1_baseline --compare current
-
-  # Run specific test cases
-  uv run prompt-test run --prompt current --cases basic_loop_001
-
-  # Use a different Claude model for evaluation
-  uv run prompt-test run --prompt current --reviewer-model claude-3-5-sonnet-20241022
-
-  # List available test cases and prompts
-  uv run prompt-test list
-
-  # Review results interactively (web interface)
-  uv run prompt-test review --interactive
-
-  # Review specific results file (CLI)
-  uv run prompt-test review --results-file results/20241201_120000_current.json
-
-  # Analyze all results
-  uv run prompt-test analyze
-
-  # Get improvement suggestions based on test results
-  uv run prompt-test improve --prompt current
-
-  # Create an experimental improved version
-  uv run prompt-test improve --prompt current --create-improved --output current_v2
-
-  # Publish a tested prompt to production
-  uv run prompt-test publish --prompt v7 --name "Production v8"
-
-  # List available C++ compilers
-  uv run prompt-test compilers --language c++
-
-  # Search for GCC compilers and generate mapping file
-  uv run prompt-test compilers --search gcc --generate-map compiler_map.json
-
-  # Enrich test cases with real assembly from CE
-  uv run prompt-test enrich --input test_cases/new_tests.yaml --compiler-map compiler_map.json
-""",
-)
-@click.option(
-    "--project-root",
-    default=str(Path.cwd()),
-    help="Project root directory (default: current directory)",
-    type=click.Path(exists=True, file_okay=False, dir_okay=True),
-)
+@click.group(help="Prompt testing for Claude explain service")
+@click.option("--project-root", default=str(Path.cwd()), type=click.Path(exists=True))
 @click.pass_context
 def cli(ctx, project_root):
-    """Main CLI group."""
     ctx.ensure_object(dict)
     ctx.obj["project_root"] = Path(project_root)
 
 
 @cli.command()
-@click.option("--prompt", required=True, help="Prompt version to test")
-@click.option("--cases", multiple=True, help="Specific test case IDs to run")
-@click.option("--categories", multiple=True, help="Test case categories to run")
-@click.option("--compare", help="Compare with another prompt version")
-@click.option("--output", help="Output file name (auto-generated if not specified)")
-@click.option(
-    "--audience",
-    type=click.Choice([level.value for level in AudienceLevel]),
-    help="Filter test cases by target audience",
-)
-@click.option(
-    "--explanation-type",
-    type=click.Choice([exp_type.value for exp_type in ExplanationType]),
-    help="Filter test cases by explanation type",
-)
-@click.option(
-    "--reviewer-model",
-    default="claude-sonnet-4-0",
-    help="Claude model to use for reviewing (e.g., claude-sonnet-4-0, claude-3-5-sonnet-20241022)",
-)
-@click.option(
-    "--max-concurrent",
-    type=int,
-    default=5,
-    help="Maximum concurrent API requests (default: 5)",
-)
+@click.option("--prompt", default="current", help="Prompt version to test (default: current)")
+@click.option("--cases", multiple=True, help="Specific test case IDs")
+@click.option("--categories", multiple=True, help="Filter by category")
+@click.option("--output", help="Output filename")
+@click.option("--max-concurrent", type=int, default=5)
+@click.option("--review", is_flag=True, help="Also run Opus correctness review on results")
+@click.option("--review-model", default="claude-opus-4-6", help="Model for correctness review")
 @click.pass_context
-def run(ctx, prompt, cases, categories, compare, output, audience, explanation_type, reviewer_model, max_concurrent):
-    """Run test suite."""
-    project_root = ctx.obj["project_root"]
-    tester = PromptTester(
-        project_root,
-        reviewer_model=reviewer_model,
-        max_concurrent_requests=max_concurrent,
+def run(ctx, prompt, cases, categories, output, max_concurrent, review, review_model):
+    """Run test cases and save results for review."""
+    tester = PromptTester(ctx.obj["project_root"], max_concurrent=max_concurrent)
+    results = tester.run(
+        prompt_version=prompt,
+        case_ids=list(cases) if cases else None,
+        categories=list(categories) if categories else None,
     )
 
-    if compare:
-        results = tester.compare_prompt_versions(prompt, compare, list(cases) if cases else None)
-        output_file = output or f"comparison_{prompt}_vs_{compare}.json"
-    else:
-        results = tester.run_test_suite(
-            prompt,
-            list(cases) if cases else None,
-            list(categories) if categories else None,
-            audience=audience,
-            explanation_type=explanation_type,
-        )
-        output_file = output
-
-    output_path = tester.save_results(results, output_file)
-
-    # Print summary
-    if "summary" in results:
-        summary = results["summary"]
-        click.echo(f"\nSummary for {prompt}:")
-        click.echo(f"  Success rate: {summary['success_rate']:.1%}")
-        click.echo(f"  Cases: {summary['successful_cases']}/{summary['total_cases']}")
-
-        if "average_metrics" in summary:
-            avg = summary["average_metrics"]
-            click.echo(f"  Average score: {avg['overall_score']:.2f}")
-            click.echo(f"  Accuracy: {avg['accuracy']:.2f}")
-            click.echo(f"  Relevance: {avg['relevance']:.2f}")
-            click.echo(f"  Conciseness: {avg['conciseness']:.2f}")
-            click.echo(f"  Insight: {avg['insight']:.2f}")
-            click.echo(f"  Appropriateness: {avg['appropriateness']:.2f}")
-            click.echo(f"  Average tokens: {avg['average_tokens']:.0f}")
-            click.echo(f"  Average response time: {avg['average_response_time']:.0f}ms")
-
-    if compare and "case_comparisons" in results:
-        comparisons = results["case_comparisons"]
-        better_v1 = sum(1 for c in comparisons if c.get("better_version") == prompt)
-        better_v2 = sum(1 for c in comparisons if c.get("better_version") == compare)
-        click.echo(f"\nComparison {prompt} vs {compare}:")
-        click.echo(f"  {prompt} better: {better_v1} cases")
-        click.echo(f"  {compare} better: {better_v2} cases")
-
-    click.echo(f"\nDetailed results saved to: {output_path}")
+    if review:
+        results = asyncio.run(_run_reviews(ctx.obj["project_root"], results, review_model))
 
+    tester.save(results, output)
 
-@cli.command()
-@click.pass_context
-def list(ctx):
-    """List available test cases and prompts."""
-    project_root = ctx.obj["project_root"]
-
-    # List test cases
-    click.echo("Available test cases:")
-    test_cases = load_all_test_cases(str(project_root / "prompt_testing" / "test_cases"))
-
-    by_category = {}
-    for case in test_cases:
-        category = case.get("category", "unknown")
-        if category not in by_category:
-            by_category[category] = []
-        by_category[category].append(case)
-
-    for category, cases in sorted(by_category.items()):
-        click.echo(f"\n  {category}:")
-        for case in cases:
-            quality = case.get("quality", "unknown")
-            difficulty = case.get("difficulty", "unknown")
-            click.echo(f"    {case['id']} - {case.get('description', 'No description')} ({quality}, {difficulty})")
-
-    # List prompts
-    click.echo("\nAvailable prompts:")
-    prompts_dir = project_root / "prompt_testing" / "prompts"
-    if prompts_dir.exists():
-        for prompt_file in sorted(prompts_dir.glob("*.yaml")):
-            prompt_name = prompt_file.stem
-            click.echo(f"  {prompt_name}")
-    else:
-        click.echo("  No prompts directory found")
+    # Summary
+    click.echo(
+        f"\n{results['successful']}/{results['total_cases']} succeeded, total cost: ${results['total_cost_usd']:.4f}"
+    )
+    if review:
+        _print_review_summary(results)
 
 
 @cli.command()
-@click.option("--results-file", help="Results file to review (CLI mode)")
-@click.option("--interactive", "-i", is_flag=True, help="Start web interface for interactive review")
-@click.option("--port", type=int, default=5000, help="Port for web interface (default: 5000)")
-@click.option("--no-browser", is_flag=True, help="Don't automatically open browser")
+@click.argument("file_a")
+@click.argument("file_b")
+@click.option("--case", help="Show only this case ID")
 @click.pass_context
-def review(ctx, results_file, interactive, port, no_browser):
-    """Human review interface."""
-    project_root = ctx.obj["project_root"]
-
-    if interactive:
-        # Start web interface
-        try:
-            start_review_server(project_root, port=port, open_browser=not no_browser)
-        except KeyboardInterrupt:
-            click.echo("\n🛑 Review server stopped")
-        except Exception as e:
-            click.echo(f"❌ Error starting review server: {e}")
-            ctx.exit(1)
-        return
+def compare(ctx, file_a, file_b, case):
+    """Compare two result files side by side."""
+    results_dir = ctx.obj["project_root"] / "prompt_testing" / "results"
 
-    if results_file:
-        # Review specific results file via CLI
-        results_path = Path(results_file)
-        with results_path.open() as f:
-            results = json.load(f)
-
-        if "results" in results:
-            # Single test run
-            for result in results["results"]:
-                if not result["success"]:
-                    continue
-
-                review = create_simple_review_cli(result["case_id"], result["response"], result["prompt_version"])
-
-                # Save review
-                manager = ReviewManager(str(Path(project_root) / "prompt_testing" / "results"))
-                manager.save_review(review)
-
-                click.echo("Review saved!")
-
-                if click.confirm("\nContinue to next case?", default=False):
-                    continue
-                break
-        else:
-            click.echo("Invalid results file format")
-            ctx.exit(1)
-    else:
-        click.echo("Please specify either --interactive for web interface or --results-file for CLI review")
-        ctx.exit(1)
+    def load(name):
+        path = results_dir / name if not Path(name).is_absolute() else Path(name)
+        return json.loads(path.read_text())
 
+    a = load(file_a)
+    b = load(file_b)
 
-@cli.command()
-@click.option("--prompt", required=True, help="Prompt version to validate")
-@click.pass_context
-def validate(ctx, prompt):
-    """Validate prompt structure and compatibility."""
-    project_root = ctx.obj["project_root"]
-
-    if prompt == "current":
-        prompt_file = project_root / "app" / "prompt.yaml"
-    else:
-        prompt_file = project_root / "prompt_testing" / "prompts" / f"{prompt}.yaml"
+    a_by_id = {r["case_id"]: r for r in a["results"] if r["success"]}
+    b_by_id = {r["case_id"]: r for r in b["results"] if r["success"]}
 
-    if not prompt_file.exists():
-        click.echo(f"Error: Prompt file not found: {prompt_file}")
-        ctx.exit(1)
-
-    click.echo(f"Validating prompt: {prompt_file}")
-
-    try:
-        # Try to load the prompt using the production Prompt class
-        prompt_obj = Prompt(prompt_file)
-        click.echo("✓ Prompt structure is valid")
-
-        # Check model configuration
-        click.echo(f"✓ Model: {prompt_obj.model}")
-        click.echo(f"✓ Max tokens: {prompt_obj.max_tokens}")
-        click.echo(f"✓ Temperature: {prompt_obj.temperature}")
-
-        # Check audience levels
-        if prompt_obj.audience_levels:
-            click.echo(f"✓ Audience levels defined: {', '.join(prompt_obj.audience_levels.keys())}")
-        else:
-            click.echo("⚠ Warning: No audience levels defined")
-
-        # Check explanation types
-        if prompt_obj.explanation_types:
-            click.echo(f"✓ Explanation types defined: {', '.join(prompt_obj.explanation_types.keys())}")
-        else:
-            click.echo("⚠ Warning: No explanation types defined")
-
-        # Test generating messages
-        try:
-            test_request = ExplainRequest(
-                language="C++",
-                compiler="gcc 12.1",
-                compilationOptions=["-O2"],
-                instructionSet="x86_64",
-                code="int main() { return 0; }",
-                source="int main() { return 0; }",
-                asm=[AssemblyItem(text="ret", source={"line": 1})],
-                audience="beginner",
-                explanation_type="assembly",
-            )
-
-            messages = prompt_obj.generate_messages(test_request)
-            click.echo(f"✓ Successfully generated {len(messages)} messages")
-            click.echo("✓ Prompt is ready for production use")
-
-        except Exception as e:
-            click.echo(f"✗ Error generating messages: {e}")
-            ctx.exit(1)
-
-    except Exception as e:
-        click.echo(f"✗ Error loading prompt: {e}")
-        click.echo("\nMake sure the prompt has all required fields:")
-        click.echo("- model (with name, max_tokens, temperature)")
-        click.echo("- audience_levels")
-        click.echo("- explanation_types")
-        click.echo("- system_prompt")
-        click.echo("- user_prompt")
-        click.echo("- assistant_prefill")
-        ctx.exit(1)
+    common = sorted(set(a_by_id) & set(b_by_id))
+    if case:
+        common = [c for c in common if c == case]
 
+    if not common:
+        click.echo("No common successful cases to compare.")
+        return
 
-@cli.command()
+    click.echo(f"Comparing: {a.get('prompt_version', file_a)} vs {b.get('prompt_version', file_b)}")
+    click.echo(f"Model A: {a.get('model', '?')}  Model B: {b.get('model', '?')}")
+    click.echo()
+
+    total_a_cost = 0
+    total_b_cost = 0
+
+    for cid in common:
+        ra = a_by_id[cid]
+        rb = b_by_id[cid]
+        cost_a = ra["input_tokens"] * 3 / 1e6 + ra["output_tokens"] * 15 / 1e6
+        cost_b = rb["input_tokens"] * 3 / 1e6 + rb["output_tokens"] * 15 / 1e6
+        total_a_cost += cost_a
+        total_b_cost += cost_b
+
+        click.echo(f"{'=' * 72}")
+        click.echo(f"Case: {cid}")
+        click.echo(
+            f"  A: {ra['input_tokens']} in, {ra['output_tokens']} out, ${cost_a:.4f}, {ra.get('elapsed_ms', '?')}ms"
+        )
+        click.echo(
+            f"  B: {rb['input_tokens']} in, {rb['output_tokens']} out, ${cost_b:.4f}, {rb.get('elapsed_ms', '?')}ms"
+        )
+        click.echo()
+        click.echo(f"--- A ({a.get('prompt_version', file_a)}) ---")
+        click.echo(ra["explanation"][:2000])
+        if len(ra["explanation"]) > 2000:
+            click.echo(f"... ({len(ra['explanation'])} chars total)")
+        click.echo()
+        click.echo(f"--- B ({b.get('prompt_version', file_b)}) ---")
+        click.echo(rb["explanation"][:2000])
+        if len(rb["explanation"]) > 2000:
+            click.echo(f"... ({len(rb['explanation'])} chars total)")
+        click.echo()
+
+    click.echo(f"{'=' * 72}")
+    click.echo(f"Total cost — A: ${total_a_cost:.4f}  B: ${total_b_cost:.4f}")
+
+
+@cli.command("list")
 @click.pass_context
-def analyze(ctx):
-    """Analyze results and generate reports."""
-    project_root = ctx.obj["project_root"]
-    results_dir = project_root / "prompt_testing" / "results"
-
-    if not results_dir.exists():
-        click.echo("No results directory found")
-        ctx.exit(1)
+def list_cases(ctx):
+    """List available test cases."""
+    test_dir = ctx.obj["project_root"] / "prompt_testing" / "test_cases"
+    cases = load_all_test_cases(str(test_dir))
 
-    # Find all result files
-    result_files = list(results_dir.glob("*.json"))
-    if not result_files:
-        click.echo("No result files found")
-        ctx.exit(1)
+    by_cat = defaultdict(list)
+    for c in cases:
+        by_cat[c.get("category", "unknown")].append(c)
 
-    click.echo(f"Found {len(result_files)} result files:")
-
-    all_summaries = []
-    for result_file in sorted(result_files):
-        try:
-            with result_file.open() as f:
-                data = json.load(f)
-
-            if "summary" in data:
-                summary = data["summary"]
-                summary["file"] = result_file.name
-                all_summaries.append(summary)
-
-                click.echo(f"\n{result_file.name}:")
-                click.echo(f"  Prompt: {summary['prompt_version']}")
-                click.echo(f"  Success rate: {summary['success_rate']:.1%}")
-                click.echo(f"  Cases: {summary['successful_cases']}/{summary['total_cases']}")
-
-                if "average_metrics" in summary:
-                    avg = summary["average_metrics"]
-                    click.echo(f"  Avg score: {avg['overall_score']:.2f}")
-                    click.echo(f"  Accuracy: {avg['accuracy']:.2f}")
-                    click.echo(f"  Relevance: {avg['relevance']:.2f}")
-                    click.echo(f"  Conciseness: {avg['conciseness']:.2f}")
-                    click.echo(f"  Insight: {avg['insight']:.2f}")
-                    click.echo(f"  Appropriateness: {avg['appropriateness']:.2f}")
-
-        except Exception as e:
-            click.echo(f"  Error reading {result_file.name}: {e}")
-
-    # Create summary report
-    if all_summaries:
-        summary_file = results_dir / "analysis_summary.json"
-        with summary_file.open("w") as f:
-            best_prompt = None
-            if all_summaries:
-                best_summary = max(all_summaries, key=lambda x: x.get("average_metrics", {}).get("overall_score", 0))
-                best_prompt = best_summary["prompt_version"]
-
-            json.dump(
-                {"total_files": len(all_summaries), "summaries": all_summaries, "best_prompt": best_prompt}, f, indent=2
-            )
-
-        click.echo(f"\nAnalysis summary saved to: {summary_file}")
+    for cat, items in sorted(by_cat.items()):
+        click.echo(f"\n{cat}:")
+        for c in items:
+            audience = c.get("audience", "beginner")
+            click.echo(f"  {c['id']:40} {c.get('description', '')[:50]}  [{audience}]")
 
 
 @cli.command()
-@click.option("--prompt", required=True, help="Prompt version to publish")
-@click.option("--name", help="Production name for the prompt (auto-generated if not specified)")
-@click.option("--description", help="Production description (auto-cleaned if not specified)")
+@click.option("--input", "-i", "input_file", required=True, help="Input YAML file")
+@click.option("--output", "-o", help="Output file")
+@click.option("--compiler-map", "-m", help="Compiler name → CE ID mapping JSON")
+@click.option("--max-concurrent", type=int, default=3)
 @click.pass_context
-def publish(ctx, prompt, name, description):
-    """Publish a tested prompt to production (app/prompt.yaml)."""
-    import shutil
-    import subprocess
-    import tempfile
-
-    from ruamel.yaml import YAML
-
-    project_root = ctx.obj["project_root"]
-    prompt_file = project_root / "prompt_testing" / "prompts" / f"{prompt}.yaml"
-    production_file = project_root / "app" / "prompt.yaml"
-
-    # Check if prompt file exists
-    if not prompt_file.exists():
-        click.echo(f"✗ Prompt file not found: {prompt_file}")
-        ctx.exit(1)
-
-    click.echo(f"📋 Publishing prompt '{prompt}' to production...")
-
-    try:
-        # Load and clean up the prompt
-        yaml = YAML()
-        yaml.preserve_quotes = True
-        yaml.default_flow_style = False
-
-        with prompt_file.open(encoding="utf-8") as f:
-            prompt_data = yaml.load(f)
-
-        # Clean up metadata for production
-        original_name = prompt_data.get("name", prompt)
-        original_description = prompt_data.get("description", "")
-
-        # Remove experimental metadata
-        if "experiment_metadata" in prompt_data:
-            click.echo("🧹 Removing experiment_metadata for production")
-            del prompt_data["experiment_metadata"]
-
-        # Update name and description for production
-        if name:
-            prompt_data["name"] = name
-            click.echo(f"📝 Updated name: '{original_name}' → '{name}'")
-        else:
-            # Auto-generate production name
-            prod_name = f"Production {original_name.replace('Version', 'v').strip()}"
-            prompt_data["name"] = prod_name
-            click.echo(f"📝 Auto-generated name: '{original_name}' → '{prod_name}'")
-
-        if description:
-            prompt_data["description"] = description
-            click.echo("📝 Updated description")
-        else:
-            # Clean up description to remove experimental language
-            clean_desc = original_description.replace("Human feedback integration", "Production prompt")
-            clean_desc = clean_desc.replace("improved markdown formatting, conciseness", "optimized for clarity")
-            clean_desc = clean_desc.replace("based on", "incorporating")
-            prompt_data["description"] = clean_desc
-            if clean_desc != original_description:
-                click.echo("📝 Cleaned up description for production")
-
-        # Write to temporary file first
-        with tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False, encoding="utf-8") as temp_file:
-            yaml.dump(prompt_data, temp_file)
-            temp_path = Path(temp_file.name)
-
-        click.echo("✅ Cleaned up prompt metadata")
-
-        # Validate the prompt loads correctly in main service
-        click.echo("🔍 Validating prompt structure...")
-        try:
-            from app.prompt import Prompt
-
-            prompt_obj = Prompt(temp_path)
-            click.echo("✅ Prompt structure validation passed")
-        except Exception as e:
-            click.echo(f"✗ Prompt validation failed: {e}")
-            temp_path.unlink()  # Clean up temp file
-            ctx.exit(1)
-
-        # Test that we can generate messages
-        click.echo("🧪 Testing message generation...")
-        try:
-            test_request = ExplainRequest(
-                language="C++",
-                compiler="gcc 12.1",
-                compilationOptions=["-O2"],
-                instructionSet="x86_64",
-                code="int main() { return 0; }",
-                source="int main() { return 0; }",
-                asm=[AssemblyItem(text="ret", source={"line": 1})],
-                audience="beginner",
-                explanation_type="assembly",
-            )
-            messages = prompt_obj.generate_messages(test_request)
-            click.echo(f"✅ Successfully generated {len(messages)} messages")
-        except Exception as e:
-            click.echo(f"✗ Message generation test failed: {e}")
-            temp_path.unlink()  # Clean up temp file
-            ctx.exit(1)
-
-        # Backup existing production prompt
-        if production_file.exists():
-            backup_file = production_file.with_suffix(".yaml.backup")
-            shutil.copy2(production_file, backup_file)
-            click.echo(f"💾 Backed up existing prompt to {backup_file.name}")
-
-        # Copy to production
-        shutil.copy2(temp_path, production_file)
-        temp_path.unlink()  # Clean up temp file
-        click.echo(f"🚀 Copied prompt to {production_file}")
-
-        # Run integration tests
-        click.echo("🧪 Running integration tests...")
-        try:
-            result = subprocess.run(
-                ["uv", "run", "pytest", "app/test_explain.py", "-v"],
-                cwd=project_root,
-                capture_output=True,
-                text=True,
-                timeout=60,
-            )
-            if result.returncode == 0:
-                click.echo("✅ Integration tests passed")
-            else:
-                click.echo("⚠️  Integration tests had issues:")
-                click.echo(result.stdout)
-                if result.stderr:
-                    click.echo("STDERR:")
-                    click.echo(result.stderr)
-                click.echo("❗ Consider reviewing the test failures")
-        except subprocess.TimeoutExpired:
-            click.echo("⚠️  Integration tests timed out")
-        except Exception as e:
-            click.echo(f"⚠️  Could not run integration tests: {e}")
-
-        click.echo(f"\n🎉 Successfully published '{prompt}' to production!")
-        click.echo(f"📁 Location: {production_file}")
-        click.echo("\n📋 Next steps:")
-        click.echo("  1. Test the service locally: uv run fastapi dev")
-        click.echo("  2. Run manual tests: ./test-explain.sh")
-        click.echo("  3. Commit the changes: git add app/prompt.yaml && git commit")
-
-    except Exception as e:
-        click.echo(f"✗ Publication failed: {e}")
+def enrich(ctx, input_file, output, compiler_map, max_concurrent):
+    """Enrich test cases with real assembly from CE API."""
+    input_path = Path(input_file)
+    if not input_path.exists():
+        click.echo(f"Not found: {input_path}")
         ctx.exit(1)
 
+    compiler_map_data = None
+    if compiler_map:
+        compiler_map_data = json.loads(Path(compiler_map).read_text())
 
-@cli.command()
-@click.option("--prompt", required=True, help="Prompt version to improve")
-@click.option("--results-file", help="Specific results file to analyze (uses most recent if not specified)")
-@click.option("--show-suggestions", is_flag=True, default=True, help="Display suggestions in terminal")
-@click.option("--create-improved", is_flag=True, help="Create an experimental improved prompt version")
-@click.option("--output", help="Name for the improved prompt version")
-@click.pass_context
-def improve(ctx, prompt, results_file, show_suggestions, create_improved, output):
-    """Analyze results and suggest prompt improvements."""
-    project_root = ctx.obj["project_root"]
-
-    # Validate arguments
-    if create_improved and not output:
-        click.echo("Error: --output is required when using --create-improved")
-        ctx.exit(1)
-
-    optimizer = PromptOptimizer(project_root)
-
-    # Find the results file
-    if results_file:
-        results_file_path = results_file
-    else:
-        results_dir = project_root / "prompt_testing" / "results"
-        latest_file = find_latest_results_file(results_dir, prompt)
-        if not latest_file:
-            click.echo(f"No results found for prompt: {prompt}")
-            click.echo(f"Run 'prompt-test run --prompt {prompt}' first")
-            ctx.exit(1)
-        results_file_path = latest_file.name
-        click.echo(f"Using most recent results: {results_file_path}")
-
-    # Analyze and potentially create improved version with human feedback integration
-    output_path, human_stats = optimizer.analyze_and_improve_with_human_feedback(
-        results_file_path, prompt, output if create_improved else None
-    )
+    output_path = Path(output) if output else None
 
-    # Show human review status
-    if human_stats["total_reviews"] > 0:
-        coverage_pct = human_stats["coverage"] * 100
-        click.echo(f"✓ Incorporated {human_stats['total_reviews']} human reviews ({coverage_pct:.0f}% coverage)")
-    else:
-        click.echo("📝 No human reviews found, using automated analysis only")
-
-    # Display key suggestions
-    if show_suggestions and not create_improved:
-        with output_path.open() as f:
-            analysis = json.load(f)
-
-        click.echo("\n=== PROMPT IMPROVEMENT SUGGESTIONS ===")
-
-        if "analysis_summary" in analysis:
-            summary = analysis["analysis_summary"]
-            click.echo(f"\nAverage Score: {summary['average_score']:.2f}")
-            click.echo("\nMost Common Missing Topics:")
-            topics = summary.get("common_missing_topics", [])
-            if topics:
-                for topic in topics:
-                    click.echo(f"  - {topic}")
-            else:
-                click.echo("  None identified")
-
-        if "suggestions" in analysis and isinstance(analysis["suggestions"], dict):
-            suggestions = analysis["suggestions"]
-
-            if "priority_improvements" in suggestions:
-                click.echo("\n🎯 Priority Improvements:")
-                for imp in suggestions["priority_improvements"][:3]:
-                    click.echo(f"\n  Issue: {imp['issue']}")
-                    click.echo(f"  Current: '{imp.get('current_text', 'N/A')[:60]}...'")
-                    click.echo(f"  Suggested: '{imp.get('suggested_text', 'N/A')[:60]}...'")
-                    click.echo(f"  Rationale: {imp.get('rationale', 'N/A')}")
-
-            if "expected_impact" in suggestions:
-                click.echo(f"\n✨ Expected Impact: {suggestions['expected_impact']}")
+    with TestCaseEnricher() as enricher:
+        asyncio.run(
+            enricher.enrich_file_async(input_path, output_path, compiler_map_data, max_concurrent=max_concurrent)
+        )
 
 
 @cli.command()
-@click.option("--input", "-i", required=True, help="Input test case YAML file")
-@click.option("--output", "-o", help="Output file (defaults to input.enriched.yaml)")
-@click.option("--compiler-map", "-m", help="JSON file mapping test compiler names to CE compiler IDs")
-@click.option(
-    "--delay",
-    type=float,
-    default=0.5,
-    help="Delay between API calls in seconds (default: 0.5, ignored when using parallel mode)",
-)
-@click.option("--max-concurrent", type=int, default=3, help="Maximum concurrent CE API requests (default: 3)")
+@click.option("--language", "-l", help="Filter by language")
+@click.option("--search", "-s", help="Search by name")
+@click.option("--limit", type=int, default=50)
 @click.pass_context
-def enrich(ctx, input, output, compiler_map, delay, max_concurrent):  # noqa: ARG001
-    """Enrich test cases with CE API data."""
-    input_file = Path(input)
-    if not input_file.exists():
-        click.echo(f"Input file not found: {input_file}")
-        ctx.exit(1)
-
-    # Load compiler map if provided
-    compiler_map_data = None
-    if compiler_map:
-        map_file = Path(compiler_map)
-        if map_file.exists():
-            with map_file.open(encoding="utf-8") as f:
-                compiler_map_data = json.load(f)
-            click.echo(f"Loaded compiler map with {len(compiler_map_data)} entries")
-
-    # Create output path
-    output_file = None
-    if output:
-        output_file = Path(output)
-
-    # Enrich test cases
-    with TestCaseEnricher() as enricher:
-        try:
-            # Use async version with max_concurrent parameter
-            asyncio.run(
-                enricher.enrich_file_async(
-                    input_file,
-                    output_file,
-                    compiler_map_data,
-                    max_concurrent=max_concurrent,
-                )
-            )
-        except Exception as e:
-            click.echo(f"Enrichment failed: {e}")
-            ctx.exit(1)
-
-
-def _generate_compiler_mapping(compilers: list, output_file: Path) -> None:
-    """Generate a compiler name to ID mapping file.
-
-    Args:
-        compilers: List of compiler objects
-        output_file: Path to save the mapping
-    """
-    mapping = {}
-    for compiler in compilers:
-        # Use the full name as key
-        mapping[compiler.name] = compiler.id
-        # Also add common short versions
-        if "gcc" in compiler.name.lower():
-            # Extract version like "gcc 13.1" from "x86-64 gcc 13.1"
-            parts = compiler.name.split()
-            for i, part in enumerate(parts):
-                if part.lower() == "gcc" and i + 1 < len(parts):
-                    short_name = f"gcc {parts[i + 1]}"
-                    if short_name not in mapping:
-                        mapping[short_name] = compiler.id
-                    break
-
-    save_json_results(mapping, output_file)
-    click.echo(f"Generated compiler mapping file: {output_file}")
-
-
-def _filter_compilers(compilers: list, instruction_set, search) -> list:
-    """Filter compilers based on command arguments.
-
-    Args:
-        compilers: List of compiler objects
-        instruction_set: Instruction set filter
-        search: Search string
-
-    Returns:
-        Filtered list of compilers
-    """
-    # Filter by instruction set if requested
-    if instruction_set:
-        compilers = [c for c in compilers if c.instruction_set == instruction_set]
-        click.echo(f"Filtered to {len(compilers)} compilers with instruction set '{instruction_set}'\n")
-
-    # Search if requested
-    if search:
-        search_lower = search.lower()
-        compilers = [c for c in compilers if search_lower in c.name.lower() or search_lower in c.id.lower()]
-        click.echo(f"Found {len(compilers)} compilers matching '{search}'\n")
-
-    return compilers
+def compilers(ctx, language, search, limit):  # noqa: ARG001
+    """List available compilers from CE API."""
+    with CompilerExplorerClient() as client:
+        results = client.get_compilers(language)
+        if search:
+            sl = search.lower()
+            results = [c for c in results if sl in c.name.lower() or sl in c.id.lower()]
+        for c in sorted(results, key=lambda c: c.name)[:limit]:
+            click.echo(f"{c.id:25} {c.name}")
+        if len(results) > limit:
+            click.echo(f"... and {len(results) - limit} more")
+
+
+async def _run_reviews(project_root: Path, results: dict, model: str) -> dict:
+    """Run correctness reviews on all successful results."""
+    from prompt_testing.reviewer import CorrectnessReviewer
+
+    reviewer = CorrectnessReviewer(model=model)
+    test_dir = project_root / "prompt_testing" / "test_cases"
+    all_cases = load_all_test_cases(str(test_dir))
+    cases_by_id = {c["id"]: c for c in all_cases}
+
+    successful = [r for r in results["results"] if r["success"]]
+    click.echo(f"\nReviewing {len(successful)} results with {model}...")
+
+    review_cost = 0.0
+    errors_found = 0
+
+    for i, result in enumerate(successful, 1):
+        case = cases_by_id.get(result["case_id"])
+        if not case:
+            continue
+
+        review = await reviewer.review_test_result(case, result["explanation"])
+        result["review"] = review
+
+        status = "✓" if review.get("correct") else "✗"
+        n_issues = len(review.get("issues", []))
+        if not review.get("correct"):
+            errors_found += 1
+        # Opus pricing: $15/M in, $75/M out
+        cost = review.get("reviewer_input_tokens", 0) * 15 / 1e6 + review.get("reviewer_output_tokens", 0) * 75 / 1e6
+        review_cost += cost
+        click.echo(f"  [{i}/{len(successful)}] {status} {result['case_id']} ({n_issues} issues, ${cost:.4f})")
+
+    results["review_model"] = model
+    results["review_cost_usd"] = round(review_cost, 6)
+    results["total_cost_usd"] = round(results["total_cost_usd"] + review_cost, 6)
+    results["errors_found"] = errors_found
+    return results
+
+
+def _print_review_summary(results: dict) -> None:
+    """Print a summary of correctness reviews."""
+    reviewed = [r for r in results["results"] if r.get("review")]
+    correct = sum(1 for r in reviewed if r["review"].get("correct"))
+    incorrect = len(reviewed) - correct
+
+    click.echo(f"\nCorrectness: {correct}/{len(reviewed)} passed")
+    if incorrect:
+        click.echo(f"\n⚠ {incorrect} case(s) with issues:")
+        for r in reviewed:
+            review = r["review"]
+            if not review.get("correct"):
+                click.echo(f"\n  {r['case_id']}:")
+                for issue in review.get("issues", []):
+                    sev = "🔴" if issue["severity"] == "error" else "🟡"
+                    click.echo(f"    {sev} {issue['claim']}")
+                    click.echo(f"       → {issue['correction']}")
+
+    click.echo(f"\nReview cost: ${results.get('review_cost_usd', 0):.4f} ({results.get('review_model', '?')})")
 
 
 @cli.command()
-@click.option("--language", "-l", help="Filter by language (e.g., c++, c, rust)")
-@click.option("--search", "-s", help="Search for compiler by name")
-@click.option("--instruction-set", "-i", help="Filter by instruction set")
-@click.option("--group", "-g", is_flag=True, help="Group by compiler type")
-@click.option("--json", "json_output", is_flag=True, help="Output in JSON format")
-@click.option("--verbose", "-v", is_flag=True, help="Show detailed compiler info")
-@click.option("--limit", type=int, default=50, help="Maximum compilers to show (default: 50)")
-@click.option("--generate-map", help="Generate a compiler name to ID mapping file")
+@click.argument("results_file")
+@click.option("--model", default="claude-opus-4-6", help="Reviewer model")
 @click.pass_context
-def compilers(ctx, language, search, instruction_set, group, json_output, verbose, limit, generate_map):  # noqa: ARG001
-    """List available compilers from CE API."""
-    from collections import defaultdict
+def review(ctx, results_file, model):
+    """Run Opus correctness review on existing results."""
+    results_dir = ctx.obj["project_root"] / "prompt_testing" / "results"
+    path = results_dir / results_file if not Path(results_file).is_absolute() else Path(results_file)
 
-    with CompilerExplorerClient() as client:
-        # Get compilers
-        click.echo(f"Fetching compilers{f' for {language}' if language else ''}...")
-        compilers = client.get_compilers(language)
-        click.echo(f"Found {len(compilers)} compilers\n")
-
-        # Apply filters
-        compilers = _filter_compilers(compilers, instruction_set, search)
-
-        if not compilers:
-            click.echo("No compilers found matching criteria")
-            return
-
-        # Generate mapping file if requested
-        if generate_map:
-            _generate_compiler_mapping(compilers, Path(generate_map))
-            return
-
-        # Display compilers
-        if json_output:
-            # JSON output for scripting
-            output = []
-            for compiler in compilers:
-                output.append(
-                    {
-                        "id": compiler.id,
-                        "name": compiler.name,
-                        "version": compiler.version,
-                        "lang": compiler.lang,
-                        "instruction_set": compiler.instruction_set,
-                        "compiler_type": compiler.compiler_type,
-                    }
-                )
-            click.echo(json.dumps(output, indent=2))
-        else:
-            # Human-readable output
-            if group:
-                # Group by compiler type
-                by_type = defaultdict(list)
-                for compiler in compilers:
-                    by_type[compiler.compiler_type or "unknown"].append(compiler)
-
-                for comp_type, comp_list in sorted(by_type.items()):
-                    click.echo(f"\n{comp_type.upper()} ({len(comp_list)} compilers):")
-                    for compiler in sorted(comp_list, key=lambda c: c.name)[:20]:
-                        click.echo(f"  {compiler.id:25} {compiler.name}")
-                    if len(comp_list) > 20:
-                        click.echo(f"  ... and {len(comp_list) - 20} more")
-            else:
-                # List all (limited)
-                for compiler in sorted(compilers, key=lambda c: c.name)[:limit]:
-                    click.echo(f"{compiler.id:25} {compiler.name}")
-                    if verbose:
-                        if compiler.version:
-                            click.echo(f"  {'':25} Version: {compiler.version}")
-                        if compiler.instruction_set:
-                            click.echo(f"  {'':25} Instruction set: {compiler.instruction_set}")
-                        if compiler.compiler_type:
-                            click.echo(f"  {'':25} Type: {compiler.compiler_type}")
-                        click.echo()
-
-                if len(compilers) > limit:
-                    click.echo(f"\n... and {len(compilers) - limit} more compilers")
-                    click.echo("Use --limit to show more")
+    results = json.loads(path.read_text())
+    results = asyncio.run(_run_reviews(ctx.obj["project_root"], results, model))
+
+    # Save updated results
+    path.write_text(json.dumps(results, indent=2))
+    click.echo(f"\nUpdated {path}")
+    _print_review_summary(results)
 
 
 def main():
-    """Main CLI entry point."""
     cli()
 
 
diff --git a/prompt_testing/evaluation/__init__.py b/prompt_testing/evaluation/__init__.py
deleted file mode 100644
index 9566c4a..0000000
--- a/prompt_testing/evaluation/__init__.py
+++ /dev/null
@@ -1 +0,0 @@
-# Evaluation framework for prompt testing
diff --git a/prompt_testing/evaluation/claude_reviewer.py b/prompt_testing/evaluation/claude_reviewer.py
deleted file mode 100644
index ce86caf..0000000
--- a/prompt_testing/evaluation/claude_reviewer.py
+++ /dev/null
@@ -1,389 +0,0 @@
-"""
-Claude-based AI reviewer for evaluating prompt responses.
-Uses advanced models with constitutional AI principles.
-"""
-
-import json
-from dataclasses import dataclass
-
-from anthropic import Anthropic, AsyncAnthropic
-
-from app.explanation_types import AudienceLevel, ExplanationType
-
-
-@dataclass
-class EvaluationMetrics:
-    """New metrics-based evaluation for a single response."""
-
-    # Core metrics (0-1 scale)
-    accuracy: float  # Technical correctness without false claims
-    relevance: float  # Discusses actual code, recognizes optimization level
-    conciseness: float  # Direct explanation without filler or boilerplate
-    insight: float  # Explains WHY and provides actionable understanding
-    appropriateness: float  # Matches audience level and explanation type
-
-    overall_score: float  # Weighted combination of above
-
-    # Additional metrics
-    token_count: int
-    response_time_ms: int | None = None
-
-    # Detailed feedback
-    flags: list[str] | None = None  # Issues found (BS patterns, etc)
-    strengths: list[str] | None = None  # What was done well
-    notes: str | None = None  # General feedback
-
-
-@dataclass
-class ReviewCriteria:
-    """New metrics-based criteria for Claude to evaluate responses."""
-
-    accuracy: str = """
-    Evaluate technical accuracy (0-100):
-    - Are assembly instructions correctly explained?
-    - No false claims about hardware behavior (e.g., "single-cycle multiplication")
-    - No invented optimizations or non-existent features
-    - Correct understanding of instruction semantics
-    - Heavily penalize confident incorrectness
-    """
-
-    relevance: str = """
-    Evaluate relevance to the actual code (0-100):
-    - Discusses THIS specific code, not hypothetical versions
-    - Recognizes actual optimization level from assembly patterns
-    - Acknowledges when code is clearly unoptimized
-    - No false claims of "efficiency" for obviously naive/unoptimized code
-    - No generic statements that don't match the actual assembly
-    """
-
-    conciseness: str = """
-    Evaluate conciseness and signal-to-noise ratio (0-100):
-    - Direct explanation of assembly vs generic filler
-    - No boilerplate headers ("Architecture:", "Optimization Level:", etc.)
-    - Focused, to-the-point explanations
-    - No padding with obvious restatements
-    - Avoids formulaic structure when not needed
-    """
-
-    insight: str = """
-    Evaluate practical insight and understanding (0-100):
-    - Explains WHY the compiler made these specific choices
-    - Provides actionable understanding for developers
-    - No useless or incorrect suggestions (e.g., "use __builtin_mul")
-    - Focuses on actual patterns present in THIS code
-    - Helps reader understand compiler behavior principles
-    """
-
-    appropriateness: str = """
-    Evaluate appropriateness for audience and explanation type (0-100):
-    - Matches audience level without condescension
-    - Matches explanation type focus (assembly/source/optimization)
-    - Beginners get foundations, experts get depth
-    - No over-explaining basics to experts
-    - No overwhelming beginners with trivia
-    - Content matches the requested explanation type
-    """
-
-
-_AUDIENCE_LEVEL = {
-    AudienceLevel.BEGINNER: """The explanation should be aimed at beginners.
-They will need basic concepts about assembly explained, and may need to know about
-calling conventions and other key information.""",
-    AudienceLevel.EXPERIENCED: """The explanation should target an experienced audience.
-They will not need explanation about trivial assembly idioms, calling conventions etc. They
-may need to be told about more esoteric instructions. Assume the audience knows most instructions
-and can handle technical terminology and advanced optimizations.""",
-}
-
-_EXPLANATION_TYPE = {
-    ExplanationType.ASSEMBLY: """The explanation should be predominantly about the compiled assembly.""",
-    ExplanationType.HAIKU: """The explanation should be in the form of a haiku,
-    capturing the essence of the code's behavior in a poetic way.""",
-}
-
-
-class ClaudeReviewer:
-    """Uses Claude to evaluate prompt responses with sophisticated analysis."""
-
-    def __init__(
-        self,
-        anthropic_api_key: str | None = None,
-        reviewer_model: str = "claude-sonnet-4-0",
-        enable_thinking: bool = True,
-    ):
-        self.client = Anthropic(api_key=anthropic_api_key) if anthropic_api_key else Anthropic()
-        self.async_client = AsyncAnthropic(api_key=anthropic_api_key) if anthropic_api_key else AsyncAnthropic()
-        self.reviewer_model = reviewer_model
-        self.enable_thinking = enable_thinking
-        self.criteria = ReviewCriteria()
-
-    def _build_evaluation_prompt(
-        self,
-        source_code: str,
-        assembly_code: str,
-        explanation: str,
-        test_case: dict,
-        audience: AudienceLevel,
-        explanation_type: ExplanationType,
-    ) -> str:
-        """Build the evaluation prompt for Claude."""
-
-        # Use the same criteria for all explanation types
-        criteria = {
-            "accuracy": self.criteria.accuracy,
-            "relevance": self.criteria.relevance,
-            "conciseness": self.criteria.conciseness,
-            "insight": self.criteria.insight,
-            "appropriateness": self.criteria.appropriateness,
-        }
-
-        prompt = f"""You are an expert in compiler technology and technical education.
-Your task is to evaluate an AI-generated explanation of Compiler Explorer's output using our metrics.
-
-## Context
-
-The user provided this source code:
-```
-{source_code}
-```
-
-Which compiled to this assembly:
-```
-{assembly_code}
-```
-
-## The AI's Explanation to Evaluate
-
-{explanation}
-
-## Evaluation Context
-
-We assume the user is aware of which compiler they've selected, and if they have provided
-command-line parameters, they are aware what those do. Similarly they know the architecture
-they've selected and so there is no need to repeat any of this information unless it is
-critical to a point that needs to be made later on.
-
-Target audience: {audience.value}
-{_AUDIENCE_LEVEL[audience]}
-
-Explanation type: {explanation_type.value}
-{_EXPLANATION_TYPE[explanation_type]}
-
-Test case description: {test_case.get("description", "No description provided")}
-
-## METRICS SYSTEM
-
-Evaluate the explanation on these 5 dimensions:
-
-1. **Accuracy (0-100)**
-{criteria["accuracy"]}
-
-2. **Relevance (0-100)**
-{criteria["relevance"]}
-
-3. **Conciseness (0-100)**
-{criteria["conciseness"]}
-
-4. **Insight (0-100)**
-{criteria["insight"]}
-
-5. **Appropriateness (0-100)**
-{criteria["appropriateness"]}
-
-"""
-
-        # Add test case specific context if available
-        if test_case.get("common_mistakes"):
-            prompt += f"""
-## Common Mistakes to Watch For
-This test case commonly produces these mistakes: {", ".join(test_case["common_mistakes"])}
-Check if the explanation falls into any of these traps.
-
-"""
-
-        prompt += f"""
-## Response Format
-
-{"First, think through your evaluation step by step." if self.enable_thinking else ""}
-
-Then provide your evaluation in this exact JSON format:
-```json
-{{
-    "scores": {{
-        "accuracy": <0-100>,
-        "relevance": <0-100>,
-        "conciseness": <0-100>,
-        "insight": <0-100>,
-        "appropriateness": <0-100>
-    }},
-    "flags": ["Unverified technical claims", "Claims efficiency on unoptimized code", ...],
-    "strengths": ["Correctly explains instruction behavior", "Good contextual relevance", ...],
-    "overall_assessment": "A 1-2 sentence overall assessment"
-}}
-```
-"""
-        return prompt
-
-    def evaluate_response(
-        self,
-        source_code: str,
-        assembly_code: str,
-        explanation: str,
-        test_case: dict,
-        audience: AudienceLevel,
-        explanation_type: ExplanationType,
-        token_count: int = 0,
-        response_time_ms: int | None = None,
-    ) -> EvaluationMetrics:
-        """Evaluate a response using Claude."""
-
-        evaluation_prompt = self._build_evaluation_prompt(
-            source_code, assembly_code, explanation, test_case, audience, explanation_type
-        )
-
-        # Call Claude for evaluation
-        message = self.client.messages.create(
-            model=self.reviewer_model,
-            max_tokens=2000,
-            temperature=0.2,  # Lower temperature for more consistent evaluation
-            system="You are a meticulous technical reviewer with expertise in compilers and education.",
-            messages=[{"role": "user", "content": evaluation_prompt}],
-        )
-
-        # Parse the JSON response
-        response_text = message.content[0].text
-
-        # Extract JSON from the response (handle thinking output if present)
-        json_start = response_text.find("{")
-        json_end = response_text.rfind("}") + 1
-
-        if json_start == -1 or json_end == 0:
-            raise ValueError(f"No valid JSON found in Claude's response: {response_text[:200]}...")
-
-        json_str = response_text[json_start:json_end]
-
-        try:
-            evaluation = json.loads(json_str)
-        except json.JSONDecodeError as e:
-            raise ValueError(
-                f"Failed to parse Claude's evaluation response as JSON: {e}\nResponse: {json_str[:200]}..."
-            ) from e
-
-        # Convert Claude's 0-100 scores to 0-1 range
-        if "scores" not in evaluation:
-            raise ValueError(f"Missing 'scores' in evaluation response: {list(evaluation.keys())}")
-
-        scores = evaluation["scores"]
-
-        # Validate required score fields
-        required_scores = ["accuracy", "relevance", "conciseness", "insight", "appropriateness"]
-        missing_scores = [field for field in required_scores if field not in scores]
-        if missing_scores:
-            raise ValueError(f"Missing required scores: {missing_scores}")
-
-        # Calculate overall score with new weights
-        weights = {"accuracy": 0.3, "relevance": 0.25, "conciseness": 0.2, "insight": 0.15, "appropriateness": 0.1}
-        overall_score = sum(scores[metric] * weights[metric] for metric in weights) / 100
-
-        # Map to EvaluationMetrics format
-        return EvaluationMetrics(
-            accuracy=scores["accuracy"] / 100,
-            relevance=scores["relevance"] / 100,
-            conciseness=scores["conciseness"] / 100,
-            insight=scores["insight"] / 100,
-            appropriateness=scores["appropriateness"] / 100,
-            overall_score=overall_score,
-            token_count=token_count,
-            response_time_ms=response_time_ms,
-            flags=evaluation.get("flags"),
-            strengths=evaluation.get("strengths"),
-            notes=evaluation.get("overall_assessment"),
-        )
-
-    def _calculate_length_score(self, explanation: str, difficulty: str) -> float:
-        """Simple length scoring (can be refined based on Claude's feedback)."""
-        length = len(explanation)
-        length_ranges = {"beginner": (150, 400), "intermediate": (200, 600), "advanced": (250, 800)}
-        min_len, max_len = length_ranges.get(difficulty, (200, 600))
-
-        if min_len <= length <= max_len:
-            return 1.0
-        if length < min_len:
-            return max(0.3, length / min_len)
-        return max(0.5, 1.0 - (length - max_len) / max_len)
-
-    async def evaluate_response_async(
-        self,
-        source_code: str,
-        assembly_code: str,
-        explanation: str,
-        test_case: dict,
-        audience: AudienceLevel,
-        explanation_type: ExplanationType,
-        token_count: int = 0,
-        response_time_ms: int | None = None,
-    ) -> EvaluationMetrics:
-        """Evaluate a response using Claude asynchronously."""
-
-        evaluation_prompt = self._build_evaluation_prompt(
-            source_code, assembly_code, explanation, test_case, audience, explanation_type
-        )
-
-        # Call Claude for evaluation asynchronously
-        message = await self.async_client.messages.create(
-            model=self.reviewer_model,
-            max_tokens=2000,
-            temperature=0.2,  # Lower temperature for more consistent evaluation
-            system="You are a meticulous technical reviewer with expertise in compilers and education.",
-            messages=[{"role": "user", "content": evaluation_prompt}],
-        )
-
-        # Parse the JSON response
-        response_text = message.content[0].text
-
-        # Extract JSON from the response (handle thinking output if present)
-        json_start = response_text.find("{")
-        json_end = response_text.rfind("}") + 1
-
-        if json_start == -1 or json_end == 0:
-            raise ValueError(f"No valid JSON found in Claude's response: {response_text[:200]}...")
-
-        json_str = response_text[json_start:json_end]
-
-        try:
-            evaluation = json.loads(json_str)
-        except json.JSONDecodeError as e:
-            raise ValueError(
-                f"Failed to parse Claude's evaluation response as JSON: {e}\nResponse: {json_str[:200]}..."
-            ) from e
-
-        # Convert Claude's 0-100 scores to 0-1 range
-        if "scores" not in evaluation:
-            raise ValueError(f"Missing 'scores' in evaluation response: {list(evaluation.keys())}")
-
-        scores = evaluation["scores"]
-
-        # Validate required score fields
-        required_scores = ["accuracy", "relevance", "conciseness", "insight", "appropriateness"]
-        missing_scores = [field for field in required_scores if field not in scores]
-        if missing_scores:
-            raise ValueError(f"Missing required scores: {missing_scores}")
-
-        # Calculate overall score with new weights
-        weights = {"accuracy": 0.3, "relevance": 0.25, "conciseness": 0.2, "insight": 0.15, "appropriateness": 0.1}
-        overall_score = sum(scores[metric] * weights[metric] for metric in weights) / 100
-
-        # Map to EvaluationMetrics format
-        return EvaluationMetrics(
-            accuracy=scores["accuracy"] / 100,
-            relevance=scores["relevance"] / 100,
-            conciseness=scores["conciseness"] / 100,
-            insight=scores["insight"] / 100,
-            appropriateness=scores["appropriateness"] / 100,
-            overall_score=overall_score,
-            token_count=token_count,
-            response_time_ms=response_time_ms,
-            flags=evaluation.get("flags"),
-            strengths=evaluation.get("strengths"),
-            notes=evaluation.get("overall_assessment"),
-        )
diff --git a/prompt_testing/evaluation/prompt_advisor.py b/prompt_testing/evaluation/prompt_advisor.py
deleted file mode 100644
index d240b08..0000000
--- a/prompt_testing/evaluation/prompt_advisor.py
+++ /dev/null
@@ -1,832 +0,0 @@
-"""
-Prompt improvement advisor using Claude to analyze results and suggest improvements.
-"""
-
-import json
-from pathlib import Path
-from typing import Any
-
-from anthropic import Anthropic
-
-from prompt_testing.evaluation.reviewer import HumanReview, ReviewManager
-from prompt_testing.yaml_utils import create_yaml_dumper, load_yaml_file
-
-
-class PromptAdvisor:
-    """Uses Claude to analyze test results and suggest prompt improvements."""
-
-    def __init__(
-        self,
-        anthropic_api_key: str | None = None,
-        advisor_model: str = "claude-sonnet-4-0",
-    ):
-        self.client = Anthropic(api_key=anthropic_api_key) if anthropic_api_key else Anthropic()
-        self.advisor_model = advisor_model
-
-    def load_human_reviews_for_prompt(self, results_dir: str | Path, prompt_version: str) -> dict[str, HumanReview]:
-        """Load human reviews for a specific prompt, indexed by case_id."""
-        review_manager = ReviewManager(results_dir)
-        reviews = review_manager.load_reviews(prompt_version=prompt_version)
-        return {review.case_id: review for review in reviews}
-
-    def _extract_human_insights(self, human_reviews: dict[str, HumanReview]) -> str:
-        """Extract key patterns and insights from human reviews."""
-        if not human_reviews:
-            return "No human reviews available."
-
-        # Aggregate scores (1-5 scale)
-        avg_scores = {
-            "accuracy": sum(r.accuracy for r in human_reviews.values()) / len(human_reviews),
-            "relevance": sum(r.relevance for r in human_reviews.values()) / len(human_reviews),
-            "conciseness": sum(r.conciseness for r in human_reviews.values()) / len(human_reviews),
-            "insight": sum(r.insight for r in human_reviews.values()) / len(human_reviews),
-            "appropriateness": sum(r.appropriateness for r in human_reviews.values()) / len(human_reviews),
-        }
-
-        # Collect all qualitative feedback
-        all_weaknesses = []
-        all_suggestions = []
-        all_strengths = []
-
-        for review in human_reviews.values():
-            all_weaknesses.extend(review.weaknesses)
-            all_suggestions.extend(review.suggestions)
-            all_strengths.extend(review.strengths)
-
-        # Find recurring patterns
-        weakness_counts = {}
-        for weakness in all_weaknesses:
-            weakness_counts[weakness] = weakness_counts.get(weakness, 0) + 1
-
-        common_weaknesses = [w for w, count in weakness_counts.items() if count > 1]
-        unique_suggestions = list(set(all_suggestions))
-
-        scores_text = (
-            f"Accuracy {avg_scores['accuracy']:.1f}, Relevance {avg_scores['relevance']:.1f}, "
-            f"Conciseness {avg_scores['conciseness']:.1f}, Insight {avg_scores['insight']:.1f}, "
-            f"Appropriateness {avg_scores['appropriateness']:.1f}"
-        )
-
-        return f"""Human Review Summary ({len(human_reviews)} reviews):
-- Average Scores (1-5): {scores_text}
-- Recurring Issues: {", ".join(common_weaknesses) if common_weaknesses else "No recurring patterns detected"}
-- Key Suggestions: {", ".join(unique_suggestions) if unique_suggestions else "No specific suggestions provided"}
-- Noted Strengths: {", ".join(set(all_strengths)) if all_strengths else "No strengths specifically noted"}
-
-Low-scoring areas that need attention:
-{self._identify_problem_areas(avg_scores)}"""
-
-    def _identify_problem_areas(self, avg_scores: dict[str, float]) -> str:
-        """Identify areas scoring below 3.5 that need improvement."""
-        problem_areas = []
-        for area, score in avg_scores.items():
-            if score < 3.5:
-                problem_areas.append(f"- {area.replace('_', ' ').title()}: {score:.1f}/5 (needs improvement)")
-
-        return "\n".join(problem_areas) if problem_areas else "All areas scoring reasonably well (≥3.5/5)"
-
-    def analyze_with_human_feedback(
-        self,
-        current_prompt: dict[str, str],
-        test_results: list[dict[str, Any]],
-        human_reviews: dict[str, HumanReview],
-        focus_areas: list[str] | None = None,
-    ) -> dict[str, Any]:
-        """Enhanced analysis that incorporates human feedback alongside automated metrics."""
-
-        # Extract human insights
-        human_insights = self._extract_human_insights(human_reviews)
-
-        # Get automated analysis components (existing logic)
-        all_missing_topics = []
-        all_incorrect_claims = []
-        all_notes = []
-        score_distribution = []
-
-        for result in test_results:
-            if result.get("success") and result.get("metrics"):
-                metrics = result["metrics"]
-                score_distribution.append(metrics.get("overall_score", 0))
-
-                if metrics.get("missing_topics"):
-                    all_missing_topics.extend(metrics["missing_topics"])
-                if metrics.get("incorrect_claims"):
-                    all_incorrect_claims.extend(metrics["incorrect_claims"])
-                if metrics.get("notes"):
-                    all_notes.append(metrics["notes"])
-
-        # Build enhanced analysis prompt
-        analysis_prompt = self._build_analysis_prompt_with_human_feedback(
-            current_prompt,
-            human_insights,
-            all_missing_topics,
-            all_incorrect_claims,
-            all_notes,
-            score_distribution,
-            len(test_results),
-            len(human_reviews),
-            focus_areas,
-        )
-
-        # Get Claude's advice
-        message = self.client.messages.create(
-            model=self.advisor_model,
-            max_tokens=4000,
-            temperature=0.3,
-            system=(
-                "You are an expert in prompt engineering and technical documentation, "
-                "helping improve prompts for compiler explanation generation. You have both "
-                "automated metrics and human expert feedback to guide your suggestions."
-            ),
-            messages=[{"role": "user", "content": analysis_prompt}],
-        )
-
-        try:
-            # Parse Claude's response
-            response_text = message.content[0].text
-
-            # Extract JSON from response
-            json_start = response_text.find("```json")
-            json_end = response_text.find("```", json_start + 7)
-
-            if json_start != -1 and json_end != -1:
-                json_str = response_text[json_start + 7 : json_end].strip()
-                suggestions = json.loads(json_str)
-            else:
-                # Fallback: try to parse entire response as JSON
-                suggestions = json.loads(response_text)
-
-            return {
-                "current_prompt": current_prompt,
-                "human_feedback_summary": human_insights,
-                "analysis_summary": {
-                    "total_cases": len(test_results),
-                    "human_reviewed_cases": len(human_reviews),
-                    "human_coverage": f"{len(human_reviews)}/{len(test_results)} "
-                    f"({100 * len(human_reviews) / len(test_results):.0f}%)"
-                    if len(test_results) > 0
-                    else "0/0 (0%)",
-                    "average_score": sum(score_distribution) / len(score_distribution) if score_distribution else 0,
-                    "common_missing_topics": list(set(all_missing_topics)),
-                    "common_incorrect_claims": list(set(all_incorrect_claims)),
-                },
-                "suggestions": suggestions,
-            }
-
-        except (json.JSONDecodeError, KeyError, IndexError) as e:
-            return {
-                "error": f"Failed to parse Claude's analysis: {e}",
-                "raw_response": message.content[0].text,
-                "current_prompt": current_prompt,
-                "human_feedback_summary": human_insights,
-            }
-
-    def analyze_results_and_suggest_improvements(
-        self,
-        current_prompt: dict[str, str],
-        test_results: list[dict[str, Any]],
-        focus_areas: list[str] | None = None,
-    ) -> dict[str, Any]:
-        """Analyze test results and suggest prompt improvements."""
-
-        # Aggregate feedback from results
-        all_missing_topics = []
-        all_incorrect_claims = []
-        all_notes = []
-        score_distribution = []
-
-        for result in test_results:
-            if result.get("success") and result.get("metrics"):
-                metrics = result["metrics"]
-                score_distribution.append(metrics.get("overall_score", 0))
-
-                if metrics.get("missing_topics"):
-                    all_missing_topics.extend(metrics["missing_topics"])
-                if metrics.get("incorrect_claims"):
-                    all_incorrect_claims.extend(metrics["incorrect_claims"])
-                if metrics.get("notes"):
-                    all_notes.append(metrics["notes"])
-
-        # Build analysis prompt
-        analysis_prompt = self._build_analysis_prompt(
-            current_prompt, all_missing_topics, all_incorrect_claims, all_notes, score_distribution, focus_areas
-        )
-
-        # Get Claude's advice
-        message = self.client.messages.create(
-            model=self.advisor_model,
-            max_tokens=4000,
-            temperature=0.3,
-            system=(
-                "You are an expert in prompt engineering and technical documentation, "
-                "helping improve prompts for compiler explanation generation."
-            ),
-            messages=[{"role": "user", "content": analysis_prompt}],
-        )
-
-        # Parse response
-        response_text = message.content[0].text
-
-        # Extract JSON from Claude's response
-        json_start = response_text.find("{")
-        json_end = response_text.rfind("}") + 1
-        if json_start >= 0 and json_end > json_start:
-            suggestions = json.loads(response_text[json_start:json_end])
-        else:
-            # If no JSON found, raise an error
-            raise ValueError(f"Claude did not provide JSON response. Got: {response_text[:200]}...")
-
-        return {
-            "current_prompt": current_prompt,
-            "analysis_summary": {
-                "total_cases": len(test_results),
-                "average_score": sum(score_distribution) / len(score_distribution) if score_distribution else 0,
-                "common_missing_topics": self._get_most_common(all_missing_topics, 5),
-                "common_incorrect_claims": self._get_most_common(all_incorrect_claims, 3),
-            },
-            "suggestions": suggestions,
-            "model_used": self.advisor_model,
-        }
-
-    def _build_analysis_prompt(
-        self,
-        current_prompt: dict[str, str],
-        missing_topics: list[str],
-        incorrect_claims: list[str],
-        notes: list[str],
-        scores: list[float],
-        focus_areas: list[str] | None,
-    ) -> str:
-        """Build the prompt for Claude to analyze results."""
-
-        prompt = f"""I need help improving prompts for an AI system that explains compiler output.
-Please analyze the test results and suggest specific improvements.
-
-## Current Prompts
-
-System Prompt:
-```
-{current_prompt.get("system_prompt", "Not provided")}
-```
-
-User Prompt:
-```
-{current_prompt.get("user_prompt", "Not provided")}
-```
-
-Assistant Prefill:
-```
-{current_prompt.get("assistant_prefill", "Not provided")}
-```
-
-## Test Results Analysis
-
-Average Score: {sum(scores) / len(scores) if scores else 0:.2f}/1.0
-Score Distribution: Min={min(scores) if scores else 0:.2f}, Max={max(scores) if scores else 0:.2f}
-
-### Common Issues Found
-
-Missing Topics (frequency):
-{self._format_frequency_list(missing_topics)}
-
-Incorrect Claims Made:
-{self._format_list(incorrect_claims)}
-
-Reviewer Notes Summary:
-{self._format_list(notes[:5])}  # Show first 5
-
-"""
-
-        if focus_areas:
-            prompt += f"""
-### Focus Areas for Improvement
-{self._format_list(focus_areas)}
-
-"""
-
-        prompt += """
-## Task
-
-Based on this analysis, please suggest specific improvements to the prompts. Focus on:
-
-1. **Addressing Missing Topics**: How can we modify the prompts to ensure these topics are covered?
-2. **Preventing Incorrect Claims**: What guardrails or instructions would help?
-3. **Improving Weak Areas**: Based on the reviewer notes, what changes would help?
-4. **Concrete Examples**: Provide specific wording changes, not just general advice.
-5. **Targeted Improvements**: Consider whether each suggestion should apply to:
-   - Specific audiences (beginner/intermediate/expert)
-   - Specific explanation types (assembly/source/optimization)
-   - General guidelines (applies to all cases)
-
-IMPORTANT: Avoid prescriptive language that creates checklist-style behavior. Instead of:
-- "Always include X" → "When X would help understanding, explain it"
-- "Explicitly state Y" → "When Y is relevant to the explanation, mention it"
-- "Include Z as a standard part" → "Weave Z into the explanation where it adds value"
-
-The goal is natural, contextual explanations, not formulaic outputs with mandatory sections.
-
-Provide your response in this JSON format:
-```json
-{
-    "priority_improvements": [
-        {
-            "issue": "Description of the issue",
-            "current_text": "The problematic part of the current prompt",
-            "suggested_text": "The improved version",
-            "rationale": "Why this change will help"
-        }
-    ],
-    "system_prompt_changes": {
-        "additions": ["New instructions to add"],
-        "modifications": ["Parts to modify and how"],
-        "removals": ["Parts to remove"]
-    },
-    "user_prompt_changes": {
-        "additions": ["New elements to add"],
-        "modifications": ["Parts to modify and how"],
-        "removals": ["Parts to remove"]
-    },
-    "general_recommendations": [
-        "Higher-level suggestions for the overall approach"
-    ],
-    "expected_impact": "Summary of how these changes should improve performance"
-}
-```
-"""
-        return prompt
-
-    def _build_analysis_prompt_with_human_feedback(
-        self,
-        current_prompt: dict[str, str],
-        human_insights: str,
-        missing_topics: list[str],
-        incorrect_claims: list[str],
-        notes: list[str],
-        scores: list[float],
-        total_cases: int,
-        human_reviewed_cases: int,
-        focus_areas: list[str] | None,
-    ) -> str:
-        """Build enhanced analysis prompt that incorporates human feedback."""
-
-        coverage_pct = (human_reviewed_cases / total_cases * 100) if total_cases > 0 else 0
-
-        prompt = f"""I need help improving prompts for an AI system that explains compiler output.
-Please analyze the test results using BOTH automated metrics and human expert feedback.
-
-## Test Results Overview
-
-Total test cases: {total_cases}
-Human review coverage: {human_reviewed_cases}/{total_cases} ({coverage_pct:.0f}%)
-Automated average score: {sum(scores) / len(scores) if scores else 0:.2f}/1.0
-
-## Current Prompts
-
-System Prompt:
-```
-{current_prompt.get("system_prompt", "Not provided")}
-```
-
-User Prompt:
-```
-{current_prompt.get("user_prompt", "Not provided")}
-```
-
-Assistant Prefill:
-```
-{current_prompt.get("assistant_prefill", "Not provided")}
-```
-
-## Human Expert Feedback (Priority Insights)
-
-{human_insights}
-
-## Automated Analysis Summary
-
-Score Distribution: Min={min(scores) if scores else 0:.2f}, Max={max(scores) if scores else 0:.2f}
-
-Missing Topics (automated detection):
-{self._format_frequency_list(missing_topics)}
-
-Incorrect Claims Made (automated detection):
-{self._format_list(incorrect_claims)}
-
-Sample Reviewer Notes:
-{self._format_list(notes[:3])}  # Show first 3
-
-## Improvement Priority Framework
-
-1. **CRITICAL**: Issues flagged by human review (educational value, clarity, appropriateness)
-2. **HIGH**: Patterns confirmed by both human and automated feedback
-3. **MEDIUM**: Automated flags where human reviews don't contradict
-4. **LOW**: Automated-only issues where humans rated positively
-
-## Analysis Instructions
-
-When humans and automated systems disagree, prioritize human judgment on:
-- Educational value and pedagogical clarity
-- Audience appropriateness
-- Real-world applicability
-- Engagement and interest level
-
-Trust automated systems for:
-- Technical accuracy detection
-- Consistency checking
-- Pattern recognition across many cases
-
-Focus improvement suggestions on addressing human-identified issues first,
-then automated issues that don't conflict with human feedback.
-
-"""
-
-        if focus_areas:
-            prompt += f"""
-### Specific Focus Areas Requested
-{self._format_list(focus_areas)}
-"""
-
-        prompt += """
-## Response Format
-
-Provide your analysis in this JSON format:
-
-```json
-{
-    "priority_improvements": [
-        {
-            "issue": "Issue description (specify if human-flagged, automated, or both)",
-            "current_text": "The problematic part of the current prompt",
-            "suggested_text": "The improved version",
-            "rationale": "Why this change will help (reference human/automated feedback)",
-            "priority": "critical|high|medium|low"
-        }
-    ],
-    "system_prompt_changes": {
-        "additions": ["New instructions to add"],
-        "modifications": ["Parts to modify and how"],
-        "removals": ["Parts to remove"]
-    },
-    "user_prompt_changes": {
-        "additions": ["New elements to add"],
-        "modifications": ["Parts to modify and how"],
-        "removals": ["Parts to remove"]
-    },
-    "human_feedback_integration": {
-        "human_priorities_addressed": ["Which human concerns are being addressed"],
-        "automated_retained": ["Which automated findings remain relevant"],
-        "conflicts_resolved": ["How human/automated disagreements were resolved"]
-    },
-    "expected_impact": "Summary of how these changes should improve both human satisfaction and automated metrics"
-}
-```
-"""
-        return prompt
-
-    def _get_most_common(self, items: list[str], n: int) -> list[tuple[str, int]]:
-        """Get the n most common items with counts."""
-        from collections import Counter
-
-        counter = Counter(items)
-        return counter.most_common(n)
-
-    def _format_frequency_list(self, items: list[str]) -> str:
-        """Format a list with frequency counts."""
-        if not items:
-            return "- None identified"
-
-        from collections import Counter
-
-        counter = Counter(items)
-        lines = []
-        for item, count in counter.most_common(10):
-            lines.append(f"- {item} ({count} occurrences)")
-        return "\n".join(lines)
-
-    def _format_list(self, items: list[str]) -> str:
-        """Format a simple list."""
-        if not items:
-            return "- None"
-        return "\n".join(f"- {item}" for item in items[:10])  # Limit to 10
-
-    def suggest_prompt_experiment(
-        self,
-        current_prompt: dict[str, str],
-        improvement_suggestions: dict[str, Any],
-        experiment_name: str,
-    ) -> dict[str, str]:
-        """Generate an experimental prompt version based on suggestions."""
-
-        # Start with current prompt
-        new_prompt = current_prompt.copy()
-
-        suggestions = improvement_suggestions.get("suggestions", {})
-
-        # Apply priority improvements with smart targeting
-        if "priority_improvements" in suggestions:
-            improvements = suggestions["priority_improvements"]
-
-            # Apply each improvement to the appropriate section
-            for imp in improvements:
-                if "current_text" in imp and "suggested_text" in imp:
-                    self._apply_targeted_improvement(new_prompt, imp)
-
-        # Apply system prompt changes with smart targeting
-        if "system_prompt_changes" in suggestions:
-            changes = suggestions["system_prompt_changes"]
-            if "additions" in changes:
-                self._apply_targeted_additions(new_prompt, changes["additions"])
-
-        # Apply user prompt modifications
-        if "user_prompt_changes" in suggestions:
-            changes = suggestions["user_prompt_changes"]
-            if "modifications" in changes:
-                for _mod in changes["modifications"]:
-                    # Simple implementation - in practice might need smarter parsing
-                    if "Explain the {arch} assembly output" in new_prompt.get("user_prompt", ""):
-                        new_prompt["user_prompt"] = (
-                            "Provide a systematic analysis of the {arch} assembly output, "
-                            "covering optimizations applied, source-to-assembly mapping, "
-                            "and performance implications."
-                        )
-
-        # Update assistant prefill if recommended
-        if "assistant prefill" in str(suggestions.get("general_recommendations", [])):
-            new_prompt["assistant_prefill"] = (
-                "I have analyzed the assembly code systematically, "
-                "examining optimizations, mappings, and performance implications:"
-            )
-
-        # Add experiment metadata
-        new_prompt["experiment_metadata"] = {
-            "base_prompt": current_prompt.get("name", "unknown"),
-            "experiment_name": experiment_name,
-            "improvements_applied": len(suggestions.get("priority_improvements", [])),
-            "expected_impact": suggestions.get("expected_impact", "Not specified"),
-        }
-
-        return new_prompt
-
-    def _classify_suggestion_target(self, suggestion_text: str) -> dict[str, list[str]]:
-        """Classify where a suggestion should be applied based on its content."""
-        targets = {"audiences": [], "explanation_types": [], "general": False}
-
-        suggestion_lower = suggestion_text.lower()
-
-        # Calling convention - primarily for beginners and assembly explanations
-        if any(term in suggestion_lower for term in ["calling convention", "parameter passing", "register roles"]):
-            targets["audiences"].append("beginner")
-            targets["explanation_types"].append("assembly")
-
-        # Optimization analysis - for optimization explanations and intermediate+ audiences
-        if any(term in suggestion_lower for term in ["optimization", "compare", "level", "-o0", "-o1", "-o2", "-o3"]):
-            targets["explanation_types"].append("optimization")
-            targets["audiences"].extend(["intermediate", "expert"])
-
-        # Performance implications - for intermediate+ audiences
-        if any(term in suggestion_lower for term in ["performance", "practical", "developer", "compiler-friendly"]):
-            targets["audiences"].extend(["intermediate", "expert"])
-
-        # Source mapping - for source explanations
-        if any(term in suggestion_lower for term in ["source", "mapping", "construct", "high-level"]):
-            targets["explanation_types"].append("source")
-
-        # Technical accuracy - applies to general guidelines
-        if any(
-            term in suggestion_lower for term in ["instruction", "operand", "verify", "trace", "accurate", "precise"]
-        ):
-            targets["general"] = True
-
-        # Expert-level content
-        if any(
-            term in suggestion_lower for term in ["microarchitecture", "pipeline", "advanced", "comparative analysis"]
-        ):
-            targets["audiences"].append("expert")
-
-        # Remove duplicates
-        targets["audiences"] = list(set(targets["audiences"]))
-        targets["explanation_types"] = list(set(targets["explanation_types"]))
-
-        return targets
-
-    def _apply_targeted_improvement(self, new_prompt: dict[str, Any], improvement: dict[str, str]) -> None:
-        """Apply an improvement to the appropriate section of the prompt."""
-        current_text = improvement["current_text"]
-        suggested_text = improvement["suggested_text"]
-
-        # Classify the suggestion
-        targets = self._classify_suggestion_target(improvement.get("issue", "") + " " + suggested_text)
-
-        # Apply to system prompt if it's a general improvement or direct system prompt change
-        if (
-            targets["general"] or current_text in new_prompt.get("system_prompt", "")
-        ) and "system_prompt" in new_prompt:
-            new_prompt["system_prompt"] = new_prompt["system_prompt"].replace(current_text, suggested_text)
-
-        # Apply to specific audience levels (check both base and explanation-specific locations)
-        # TODO: In the future, we may need to create new explanation-specific audience overrides
-        if targets["audiences"]:
-            for audience in targets["audiences"]:
-                # Check base audience level
-                if "audience_levels" in new_prompt and audience in new_prompt["audience_levels"]:
-                    guidance = new_prompt["audience_levels"][audience].get("guidance", "")
-                    if current_text in guidance:
-                        new_prompt["audience_levels"][audience]["guidance"] = guidance.replace(
-                            current_text, suggested_text
-                        )
-
-                # Check explanation-specific audience overrides
-                if "explanation_types" in new_prompt:
-                    for exp_config in new_prompt["explanation_types"].values():
-                        if (
-                            isinstance(exp_config, dict)
-                            and "audience_levels" in exp_config
-                            and audience in exp_config["audience_levels"]
-                        ):
-                            guidance = exp_config["audience_levels"][audience].get("guidance", "")
-                            if current_text in guidance:
-                                exp_config["audience_levels"][audience]["guidance"] = guidance.replace(
-                                    current_text, suggested_text
-                                )
-
-        # Apply to specific explanation types
-        if targets["explanation_types"] and "explanation_types" in new_prompt:
-            for exp_type in targets["explanation_types"]:
-                if exp_type in new_prompt["explanation_types"]:
-                    focus = new_prompt["explanation_types"][exp_type].get("focus", "")
-                    if current_text in focus:
-                        new_prompt["explanation_types"][exp_type]["focus"] = focus.replace(current_text, suggested_text)
-
-    def _apply_targeted_additions(self, new_prompt: dict[str, Any], additions: list[str]) -> None:
-        """Apply additions to the appropriate sections based on their content."""
-        general_additions = []
-
-        for addition in additions:
-            targets = self._classify_suggestion_target(addition)
-            applied = False
-
-            # Apply to specific audience levels (check both base and explanation-specific locations)
-            if targets["audiences"]:
-                for audience in targets["audiences"]:
-                    # Check base audience level
-                    if "audience_levels" in new_prompt and audience in new_prompt["audience_levels"]:
-                        current_guidance = new_prompt["audience_levels"][audience].get("guidance", "")
-                        new_prompt["audience_levels"][audience]["guidance"] = (
-                            current_guidance.rstrip() + f"\n{addition}\n"
-                        )
-                        applied = True
-
-                    # Check explanation-specific audience overrides
-                    if "explanation_types" in new_prompt:
-                        for exp_config in new_prompt["explanation_types"].values():
-                            if (
-                                isinstance(exp_config, dict)
-                                and "audience_levels" in exp_config
-                                and audience in exp_config["audience_levels"]
-                            ):
-                                current_guidance = exp_config["audience_levels"][audience].get("guidance", "")
-                                exp_config["audience_levels"][audience]["guidance"] = (
-                                    current_guidance.rstrip() + f"\n{addition}\n"
-                                )
-                                applied = True
-
-            # Apply to specific explanation types
-            if targets["explanation_types"] and "explanation_types" in new_prompt:
-                for exp_type in targets["explanation_types"]:
-                    if exp_type in new_prompt["explanation_types"]:
-                        current_focus = new_prompt["explanation_types"][exp_type].get("focus", "")
-                        new_prompt["explanation_types"][exp_type]["focus"] = current_focus.rstrip() + f"\n{addition}\n"
-                        applied = True
-
-            # If not applied to specific sections or is general, add to general additions
-            if not applied or targets["general"]:
-                general_additions.append(addition)
-
-        # Apply remaining general additions to system prompt
-        if general_additions and "system_prompt" in new_prompt:
-            additions_text = "\n\n# Additional guidance from analysis:\n"
-            for addition in general_additions:
-                additions_text += f"- {addition}\n"
-            new_prompt["system_prompt"] += additions_text
-
-
-class PromptOptimizer:
-    """Orchestrates the prompt optimization workflow."""
-
-    def __init__(self, project_root: str, anthropic_api_key: str | None = None):
-        self.project_root = Path(project_root)
-        self.advisor = PromptAdvisor(anthropic_api_key)
-        self.results_dir = self.project_root / "prompt_testing" / "results"
-        self.prompts_dir = self.project_root / "prompt_testing" / "prompts"
-
-    def analyze_and_improve(
-        self,
-        results_file: str,
-        prompt_version: str,
-        output_name: str | None = None,
-    ) -> Path:
-        """Analyze results and create improved prompt version."""
-
-        # Load test results
-        results_path = self.results_dir / results_file
-        with results_path.open() as f:
-            test_data = json.load(f)
-
-        # Load current prompt - handle "current" special case
-        if prompt_version == "current":
-            prompt_path = self.project_root / "app" / "prompt.yaml"
-        else:
-            prompt_path = self.prompts_dir / f"{prompt_version}.yaml"
-        current_prompt = load_yaml_file(prompt_path)
-
-        # Get improvement suggestions
-        suggestions = self.advisor.analyze_results_and_suggest_improvements(
-            current_prompt, test_data.get("results", [])
-        )
-
-        # Save analysis
-        analysis_file = self.results_dir / f"analysis_{prompt_version}_{results_file}"
-        with analysis_file.open("w") as f:
-            json.dump(suggestions, f, indent=2)
-
-        print(f"Analysis saved to: {analysis_file}")
-
-        # Create experimental prompt if requested
-        if output_name:
-            new_prompt = self.advisor.suggest_prompt_experiment(
-                current_prompt, suggestions, f"Automated improvement based on {results_file}"
-            )
-
-            new_prompt_path = self.prompts_dir / f"{output_name}.yaml"
-            yaml_out = create_yaml_dumper()
-            with new_prompt_path.open("w") as f:
-                yaml_out.dump(new_prompt, f)
-
-            print(f"Experimental prompt saved to: {new_prompt_path}")
-            return new_prompt_path
-
-        return analysis_file
-
-    def analyze_and_improve_with_human_feedback(
-        self,
-        results_file: str,
-        prompt_version: str,
-        output_name: str | None = None,
-    ) -> tuple[Path, dict[str, int]]:
-        """Enhanced analysis that automatically incorporates human reviews if available."""
-
-        # Load test results
-        results_path = self.results_dir / results_file
-        with results_path.open() as f:
-            test_data = json.load(f)
-
-        # Load current prompt - handle "current" special case
-        if prompt_version == "current":
-            prompt_path = self.project_root / "app" / "prompt.yaml"
-        else:
-            prompt_path = self.prompts_dir / f"{prompt_version}.yaml"
-        current_prompt = load_yaml_file(prompt_path)
-
-        # Load human reviews for this prompt
-        human_reviews = self.advisor.load_human_reviews_for_prompt(self.results_dir, prompt_version)
-
-        # Determine which analysis method to use
-        if human_reviews:
-            suggestions = self.advisor.analyze_with_human_feedback(
-                current_prompt, test_data.get("results", []), human_reviews
-            )
-            analysis_suffix = "human_enhanced"
-        else:
-            suggestions = self.advisor.analyze_results_and_suggest_improvements(
-                current_prompt, test_data.get("results", [])
-            )
-            analysis_suffix = "automated_only"
-
-        # Save analysis with descriptive filename
-        analysis_file = self.results_dir / f"analysis_{prompt_version}_{analysis_suffix}_{results_file}"
-        with analysis_file.open("w") as f:
-            json.dump(suggestions, f, indent=2)
-
-        # Create human review stats
-        total_test_cases = len(test_data.get("results", []))
-        human_stats = {
-            "total_reviews": len(human_reviews),
-            "coverage": len(human_reviews) / total_test_cases if total_test_cases > 0 else 0,
-        }
-
-        # Create experimental prompt if requested
-        if output_name:
-            new_prompt = self.advisor.suggest_prompt_experiment(
-                current_prompt,
-                suggestions,
-                f"Improvement based on {results_file} ({'with human feedback' if human_reviews else 'automated only'})",
-            )
-
-            new_prompt_path = self.prompts_dir / f"{output_name}.yaml"
-            yaml_out = create_yaml_dumper()
-            with new_prompt_path.open("w") as f:
-                yaml_out.dump(new_prompt, f)
-
-            print(f"Experimental prompt saved to: {new_prompt_path}")
-            return new_prompt_path, human_stats
-
-        return analysis_file, human_stats
diff --git a/prompt_testing/evaluation/review_templates.yaml b/prompt_testing/evaluation/review_templates.yaml
deleted file mode 100644
index c89b688..0000000
--- a/prompt_testing/evaluation/review_templates.yaml
+++ /dev/null
@@ -1,125 +0,0 @@
-# Review Templates for Claude-based Evaluation
-# These templates define how Claude evaluates prompt responses
-
-default_review:
-  system_prompt: |
-    You are an expert compiler engineer and technical educator reviewing explanations
-    of compiler output. Evaluate responses for technical accuracy, educational value,
-    and clarity. Be rigorous but fair in your assessment.
-
-  evaluation_dimensions:
-    technical_accuracy:
-      weight: 0.30
-      description: |
-        Evaluate the technical accuracy of the explanation:
-        - Are assembly instructions correctly explained?
-        - Are compiler optimizations accurately described?
-        - Are technical claims verifiable and correct?
-        - Does it avoid oversimplifications that lead to inaccuracy?
-        - Are register names, instruction mnemonics, and calling conventions correct?
-
-    educational_value:
-      weight: 0.25
-      description: |
-        Assess the educational value for someone learning about compilers:
-        - Is the explanation at an appropriate level for the target audience?
-        - Does it build understanding progressively?
-        - Are complex concepts explained clearly?
-        - Does it provide insight into why the compiler made certain choices?
-        - Would a reader gain actionable knowledge?
-
-    clarity_structure:
-      weight: 0.20
-      description: |
-        Evaluate clarity and structure:
-        - Is the explanation well-organized and easy to follow?
-        - Are technical terms properly introduced before use?
-        - Is the language clear and concise?
-        - Does it avoid unnecessary jargon while maintaining precision?
-        - Is there a logical flow from simple to complex concepts?
-
-    completeness:
-      weight: 0.15
-      description: |
-        Assess completeness relative to the input:
-        - Does it address all significant transformations in the assembly?
-        - Are important optimizations explained?
-        - Does it cover the key differences between source and assembly?
-        - Is the scope appropriate (not too narrow or too broad)?
-        - Are edge cases or special behaviors noted where relevant?
-
-    practical_insights:
-      weight: 0.10
-      description: |
-        Evaluate practical insights provided:
-        - Does it help developers understand performance implications?
-        - Are there actionable insights about writing better code?
-        - Does it explain when/why certain optimizations occur?
-        - Does it connect assembly behavior to source code patterns?
-        - Are real-world implications discussed?
-
-# Specialized review template for optimization-focused cases
-optimization_review:
-  system_prompt: |
-    You are a compiler optimization expert reviewing explanations of optimized code.
-    Focus particularly on how well the explanation covers optimization techniques,
-    their triggers, and performance implications. You should think about processor specifics,
-    looking at the microarchitecture and instruction set architecture (ISA) level, using the
-    compiler flags and instruction set to intuit which architecture the code is targeting.
-
-  evaluation_dimensions:
-    optimization_accuracy:
-      weight: 0.35
-      description: |
-        Specifically evaluate optimization explanations:
-        - Are optimization techniques correctly identified and named?
-        - Is the optimization trigger condition explained?
-        - Are performance benefits quantified or qualified appropriately?
-        - Are trade-offs mentioned (code size, debuggability, etc.)?
-
-    technical_accuracy:
-      weight: 0.25
-      description: |
-        General technical accuracy of the explanation
-
-    educational_value:
-      weight: 0.20
-      description: |
-        How well does it teach optimization concepts?
-
-    practical_insights:
-      weight: 0.20
-      description: |
-        Actionable insights for writing optimization-friendly code
-
-# Template for beginner-focused evaluation
-beginner_review:
-  system_prompt: |
-    You are reviewing explanations intended for beginners learning about the assembly output of compilers.
-    Prioritize clarity and educational scaffolding over exhaustive technical detail. Do not assume the audience
-    knows much about assembly, compiler optimizations, or the process of compilation.
-
-  evaluation_dimensions:
-    clarity_structure:
-      weight: 0.35
-      description: |
-        Is this accessible to beginners?
-        - Avoids overwhelming jargon
-        - Explains prerequisites
-        - Uses analogies where helpful
-        - Progresses logically
-
-    educational_value:
-      weight: 0.30
-      description: |
-        Does it effectively teach concepts to beginners?
-
-    technical_accuracy:
-      weight: 0.20
-      description: |
-        Accurate while appropriately simplified
-
-    completeness:
-      weight: 0.15
-      description: |
-        Covers essentials without overwhelming detail
diff --git a/prompt_testing/evaluation/reviewer.py b/prompt_testing/evaluation/reviewer.py
deleted file mode 100644
index a4525dd..0000000
--- a/prompt_testing/evaluation/reviewer.py
+++ /dev/null
@@ -1,156 +0,0 @@
-"""
-Human review tools for prompt evaluation.
-"""
-
-import json
-from dataclasses import asdict, dataclass
-from datetime import datetime
-from pathlib import Path
-
-
-@dataclass
-class HumanReview:
-    """Human review of a prompt response."""
-
-    case_id: str
-    prompt_version: str
-    reviewer: str
-    timestamp: str
-
-    # Scores (1-5 scale)
-    accuracy: int  # Technical correctness without false claims
-    relevance: int  # Discusses actual code, recognizes optimization level
-    conciseness: int  # Direct explanation without filler or boilerplate
-    insight: int  # Explains WHY and provides actionable understanding
-    appropriateness: int  # Matches audience level and explanation type
-
-    # Qualitative feedback
-    strengths: list[str]
-    weaknesses: list[str]
-    suggestions: list[str]
-    overall_comments: str
-
-    # Comparison (when reviewing multiple versions)
-    compared_to: str | None = None
-    preference: str | None = None  # 'this', 'other', 'similar'
-    preference_reason: str | None = None
-
-
-class ReviewManager:
-    """Manages human reviews and creates comparison interfaces."""
-
-    def __init__(self, results_dir: str | Path):
-        self.results_dir = Path(results_dir)
-        self.reviews_file = self.results_dir / "human_reviews.jsonl"
-
-    def save_review(self, review: HumanReview) -> None:
-        """Save a human review to the reviews file."""
-        self.results_dir.mkdir(parents=True, exist_ok=True)
-
-        with self.reviews_file.open("a") as f:
-            f.write(json.dumps(asdict(review)) + "\n")
-
-    def load_reviews(self, case_id: str | None = None, prompt_version: str | None = None) -> list[HumanReview]:
-        """Load reviews, optionally filtered by case ID or prompt version."""
-        if not self.reviews_file.exists():
-            return []
-
-        reviews = []
-        with self.reviews_file.open() as f:
-            for line in f:
-                review_data = json.loads(line.strip())
-                review = HumanReview(**review_data)
-
-                if case_id and review.case_id != case_id:
-                    continue
-                if prompt_version and review.prompt_version != prompt_version:
-                    continue
-
-                reviews.append(review)
-
-        return reviews
-
-    def export_review_summary(self, output_file: str) -> None:
-        """Export a summary of all reviews to a JSON file."""
-        reviews = self.load_reviews()
-
-        summary = {"total_reviews": len(reviews), "by_prompt_version": {}, "by_case": {}, "average_scores": {}}
-
-        # Group by prompt version
-        for review in reviews:
-            version = review.prompt_version
-            if version not in summary["by_prompt_version"]:
-                summary["by_prompt_version"][version] = []
-            summary["by_prompt_version"][version].append(asdict(review))
-
-        # Group by case
-        for review in reviews:
-            case_id = review.case_id
-            if case_id not in summary["by_case"]:
-                summary["by_case"][case_id] = []
-            summary["by_case"][case_id].append(asdict(review))
-
-        # Calculate average scores
-        if reviews:
-            avg_accuracy = sum(r.accuracy for r in reviews) / len(reviews)
-            avg_relevance = sum(r.relevance for r in reviews) / len(reviews)
-            avg_conciseness = sum(r.conciseness for r in reviews) / len(reviews)
-            avg_insight = sum(r.insight for r in reviews) / len(reviews)
-            avg_appropriateness = sum(r.appropriateness for r in reviews) / len(reviews)
-
-            summary["average_scores"] = {
-                "accuracy": avg_accuracy,
-                "relevance": avg_relevance,
-                "conciseness": avg_conciseness,
-                "insight": avg_insight,
-                "appropriateness": avg_appropriateness,
-            }
-
-        output_path = Path(output_file)
-        with output_path.open("w") as f:
-            json.dump(summary, f, indent=2)
-
-
-def create_simple_review_cli(case_id: str, response: str, prompt_version: str) -> HumanReview:
-    """Create a simple CLI for reviewing a single response."""
-    print(f"\n=== REVIEWING CASE: {case_id} ===")
-    print(f"Prompt Version: {prompt_version}")
-    print(f"\nResponse:\n{response}")
-    print("\n" + "=" * 50)
-
-    reviewer = input("Reviewer name: ").strip()
-
-    print("\nPlease rate the following aspects (1-5 scale):")
-    accuracy = int(input("Accuracy (technical correctness without false claims): "))
-    relevance = int(input("Relevance (discusses actual code, recognizes optimization level): "))
-    conciseness = int(input("Conciseness (direct explanation without filler): "))
-    insight = int(input("Insight (explains WHY, provides actionable understanding): "))
-    appropriateness = int(input("Appropriateness (matches audience level and explanation type): "))
-
-    print("\nPlease provide qualitative feedback:")
-    strengths = input("Strengths (comma-separated): ").split(",")
-    strengths = [s.strip() for s in strengths if s.strip()]
-
-    weaknesses = input("Weaknesses (comma-separated): ").split(",")
-    weaknesses = [w.strip() for w in weaknesses if w.strip()]
-
-    suggestions = input("Suggestions (comma-separated): ").split(",")
-    suggestions = [s.strip() for s in suggestions if s.strip()]
-
-    overall_comments = input("Overall comments: ").strip()
-
-    return HumanReview(
-        case_id=case_id,
-        prompt_version=prompt_version,
-        reviewer=reviewer,
-        timestamp=datetime.now().isoformat(),
-        accuracy=accuracy,
-        relevance=relevance,
-        conciseness=conciseness,
-        insight=insight,
-        appropriateness=appropriateness,
-        strengths=strengths,
-        weaknesses=weaknesses,
-        suggestions=suggestions,
-        overall_comments=overall_comments,
-    )
diff --git a/prompt_testing/evaluation/scorer.py b/prompt_testing/evaluation/scorer.py
deleted file mode 100644
index e084bec..0000000
--- a/prompt_testing/evaluation/scorer.py
+++ /dev/null
@@ -1,132 +0,0 @@
-"""
-Metrics-based scoring system for prompt testing.
-"""
-
-from pathlib import Path
-from typing import Any, ClassVar
-
-from prompt_testing.yaml_utils import create_yaml_loader
-
-
-def load_test_case(file_path: str, case_id: str) -> dict[str, Any]:
-    """Load a specific test case from a YAML file."""
-    path = Path(file_path)
-    yaml = create_yaml_loader()
-    with path.open(encoding="utf-8") as f:
-        data = yaml.load(f)
-
-    for case in data["cases"]:
-        if case["id"] == case_id:
-            return case
-
-    raise ValueError(f"Test case {case_id} not found in {file_path}")
-
-
-def load_all_test_cases(test_cases_dir: str) -> list[dict[str, Any]]:
-    """Load all test cases from the test_cases directory."""
-    all_cases = []
-    test_dir = Path(test_cases_dir)
-    yaml = create_yaml_loader()
-
-    for file_path in test_dir.glob("*.yaml"):
-        with file_path.open(encoding="utf-8") as f:
-            data = yaml.load(f)
-            all_cases.extend(data["cases"])
-
-    return all_cases
-
-
-class MetricsScorer:
-    """New metrics-based scorer focusing on quality over topic coverage."""
-
-    METRICS: ClassVar[dict[str, dict[str, str]]] = {
-        "accuracy": {"name": "Accuracy", "description": "Technical correctness without false claims"},
-        "relevance": {"name": "Relevance", "description": "Discusses actual code, recognizes optimization level"},
-        "conciseness": {"name": "Conciseness", "description": "Direct explanation without filler or boilerplate"},
-        "insight": {"name": "Insight", "description": "Explains WHY and provides actionable understanding"},
-        "appropriateness": {"name": "Appropriateness", "description": "Matches audience level and explanation type"},
-    }
-
-    def calculate_automatic_score(self, explanation: str, test_case: dict[str, Any]) -> dict[str, Any]:
-        """
-        Basic heuristic scoring. Claude reviewer will provide the real scores.
-
-        This is mainly for quick feedback and catching obvious issues.
-        """
-        scores = {}
-        flags = []
-
-        # Accuracy heuristics - check for common misleading patterns
-        accuracy_score = 1.0
-        misleading_patterns = [
-            "likely leverages",  # Hedge words for made-up claims
-            "compile-time optimization converts",  # Vague technical-sounding nonsense
-            "might inline this function",  # Speculation without evidence
-        ]
-        for pattern in misleading_patterns:
-            if pattern.lower() in explanation.lower():
-                accuracy_score -= 0.3
-                flags.append(f"Potentially misleading: '{pattern}'")
-
-        scores["accuracy"] = max(0.0, accuracy_score)
-
-        # Relevance heuristics - check for false optimization claims
-        relevance_score = 1.0
-        if (
-            any(word in explanation.lower() for word in ["efficient", "optimized", "optimal"])
-            and "unoptimized" not in test_case.get("description", "").lower()
-        ):
-            # If test case doesn't mention being unoptimized, this might be wrong
-            relevance_score -= 0.2
-            flags.append("Claims efficiency - check if code is actually optimized")
-
-        scores["relevance"] = max(0.0, relevance_score)
-
-        # Conciseness heuristics - check for boilerplate headers
-        conciseness_score = 1.0
-        boilerplate_patterns = [
-            "architecture:",
-            "optimization level:",
-            "calling convention:",
-            "microarchitectural observations:",
-            "performance implications:",
-        ]
-        boilerplate_count = sum(1 for pattern in boilerplate_patterns if pattern.lower() in explanation.lower())
-        if boilerplate_count > 0:
-            conciseness_score -= boilerplate_count * 0.2
-            flags.append(f"Found {boilerplate_count} boilerplate headers")
-
-        scores["conciseness"] = max(0.0, conciseness_score)
-
-        # Insight and appropriateness - hard to measure automatically
-        # Default to neutral scores, let Claude reviewer handle these
-        scores["insight"] = 0.6
-        scores["appropriateness"] = 0.6
-
-        # Calculate overall score as weighted average
-        weights = {
-            "accuracy": 0.3,  # Critical - false info is bad
-            "relevance": 0.25,  # Very important - must match actual code
-            "conciseness": 0.2,  # Important - avoid fluff
-            "insight": 0.15,  # Nice to have - explains WHY
-            "appropriateness": 0.1,  # Important but harder to measure
-        }
-
-        overall_score = sum(scores[metric] * weights[metric] for metric in scores)
-
-        return {
-            "overall_score": overall_score,
-            "metric_scores": scores,
-            "flags": flags,
-            "scoring_method": "automatic_heuristics",
-        }
-
-    def get_metrics_definition(self) -> dict[str, Any]:
-        """Return the metrics definition for use by Claude reviewer."""
-        return self.METRICS
-
-
-def calculate_scores(explanation: str, test_case: dict[str, Any]) -> dict[str, Any]:
-    """Calculate scores for an explanation using the new metrics system."""
-    scorer = MetricsScorer()
-    return scorer.calculate_automatic_score(explanation, test_case)
diff --git a/prompt_testing/file_utils.py b/prompt_testing/file_utils.py
index be07432..4925b61 100644
--- a/prompt_testing/file_utils.py
+++ b/prompt_testing/file_utils.py
@@ -4,7 +4,7 @@
 from pathlib import Path
 from typing import Any
 
-from prompt_testing.yaml_utils import create_yaml_dumper, load_yaml_file
+from prompt_testing.yaml_utils import create_yaml_dumper, create_yaml_loader, load_yaml_file
 
 
 def ensure_directory(path: Path) -> Path:
@@ -123,3 +123,24 @@ def save_prompt_file(prompt_data: dict[str, Any], output_path: Path) -> None:
             yaml.dump(prompt_data, f)
     except OSError as e:
         raise RuntimeError(f"Failed to save prompt to {output_path}: {e}") from e
+
+
+def load_all_test_cases(test_cases_dir: str) -> list[dict[str, Any]]:
+    """Load all test cases from YAML files in a directory.
+
+    Args:
+        test_cases_dir: Path to directory containing test case YAML files
+
+    Returns:
+        List of test case dicts
+    """
+    all_cases = []
+    test_dir = Path(test_cases_dir)
+    yaml = create_yaml_loader()
+
+    for file_path in sorted(test_dir.glob("*.yaml")):
+        with file_path.open(encoding="utf-8") as f:
+            data = yaml.load(f)
+            all_cases.extend(data["cases"])
+
+    return all_cases
diff --git a/prompt_testing/prompts/haiku_4_5.yaml b/prompt_testing/prompts/haiku_4_5.yaml
deleted file mode 100644
index 085b64b..0000000
--- a/prompt_testing/prompts/haiku_4_5.yaml
+++ /dev/null
@@ -1,92 +0,0 @@
-name: Haiku 4.5 Evaluation
-description: Testing Claude Haiku 4.5 with production prompt (identical to current except model)
-model:
-  name: claude-haiku-4-5
-  max_tokens: 1024
-  temperature: 0.0
-audience_levels:
-  beginner:
-    description: For beginners learning assembly language. Uses simple language and explains technical terms.
-    guidance: |
-      - Include foundational concepts about assembly basics, register purposes, and memory organization. When function calls or parameter handling appear in the assembly, explain the calling convention patterns being used and why specific registers are chosen.
-      - Use simple, clear language. Define technical terms inline when first used (e.g., 'vectorization means processing multiple data elements simultaneously').
-      - Explain concepts step-by-step. Avoid overwhelming with too many details at once.
-      - Use analogies where helpful to explain complex concepts.
-      - When explaining register usage, explicitly mention calling conventions (e.g., 'By convention, register X is used for...').
-  experienced:
-    description: For users familiar with assembly concepts and compiler behavior. Focuses on optimizations and technical details.
-    guidance: |
-      - Focus on optimization reasoning and architectural trade-offs. Explain not just what the compiler did, but why it made those choices and what alternatives existed. Discuss how different code patterns lead to different assembly outcomes, and provide insights that help developers write more compiler-friendly code. Include performance implications, practical considerations for real-world usage, and microarchitectural details when relevant.
-      - Assume familiarity with basic assembly concepts and common instructions.
-      - Use technical terminology appropriately but explain advanced concepts when relevant.
-      - Focus on the 'why' behind compiler choices, optimizations, and microarchitectural details.
-      - Explain performance implications, trade-offs, and edge cases.
-      - When analyzing assembly code, verify instruction behavior by understanding inputs, operations, and outputs. Be especially careful with multi-operand instructions. Only discuss optimization levels when clear from the code patterns.
-      - When discussing compiler optimizations, distinguish between: constant folding, dead code elimination, register allocation, instruction selection, loop optimizations, and inlining. Explain which specific optimizations are present or absent.
-      - Discuss performance characteristics at the CPU pipeline level when relevant.
-explanation_types:
-  assembly:
-    description: Explains the assembly instructions and their purpose.
-    focus: |
-      - Structure explanations by leading with the single most important insight or pattern first, then build supporting details around it.
-      - Focus on explaining the assembly instructions and their purpose.
-      - Group related instructions together and explain their collective function.
-      - Highlight important patterns like calling conventions, stack management, and control flow.
-      - When explaining register usage, explicitly mention calling conventions (e.g., 'By convention, register X is used for...').
-      - Focus on the most illuminating aspects of the assembly code. Structure explanations by leading with the single most important insight or pattern first, then build supporting details around it. Ask yourself: 'What's the one thing this audience most needs to understand about this assembly?' Start there, then add context and details. Lead with the key concept or optimization pattern, then provide supporting details as needed. Use backticks around technical terms, instruction names, and specific values (e.g., `mov`, `rax`, `0x42`) to improve readability. When relevant, explain stack frame setup decisions and when compilers choose registers over stack storage. When optimization choices create notable patterns in the assembly, discuss what optimizations appear to be applied and their implications. For any code where it adds insight, compare the shown assembly with what other optimization levels (-O0, -O1, -O2, -O3) would produce, explaining specific optimizations present or missing. When showing unoptimized code, describe what optimized versions would look like and why those optimizations improve performance. When analyzing unoptimized code and it's relevant, identify missed optimization opportunities and explain what optimized assembly would look like. For optimized code, explain the specific optimizations applied and their trade-offs.
-      - Keep explanations concise and focused on the most important insights. Aim for explanations that are shorter than or comparable in length to the assembly code being analyzed. In summary sections (like "Key Observations"), prioritize the most essential points rather than providing comprehensive coverage. Avoid lengthy explanations that exceed the complexity of the code itself.
-      - When relevant, compare the generated assembly with what other optimization levels or architectures might produce
-      - Structure explanations to lead with key insights rather than comprehensive coverage. Ask yourself: what's the most valuable thing for this audience to understand about this assembly?
-
-    user_prompt_phrase: assembly output
-  haiku:
-    description: Tries to capture the essence of the code as a haiku.
-    focus: |
-      Focus on the overall behavior and intent of the code. Use vivid imagery and concise language to convey meaning.
-      Highlight key actions and their significance. Stick to the form of a three-line haiku.
-      Produce no other output than the haiku itself.
-    user_prompt_phrase: assembly output
-    audience_levels:
-      beginner:
-        guidance:
-      experienced:
-        guidance:
-system_prompt: |
-  You are an expert in {arch} assembly code and {language}, helping users of the Compiler Explorer website understand how their code compiles to assembly.
-  The request will be in the form of a JSON document, which explains a source program and how it was compiled, and the resulting assembly code that was generated.
-
-  ## Overall guidelines:
-
-  Use these guidelines as appropriate. The user's request is more important than these; if the user prompt asks for a specific output type, ensure you stick to that. To the extent you need to explain things, use these guidelines.
-
-  - When analyzing assembly code, confidently interpret compiler behavior based on the compilation options provided. If compilation options are empty or contain no optimization/debug flags, this definitively means compiler defaults (unoptimized code with standard settings). State this confidently: "This is unoptimized code" - never use tentative language like "likely -O0", "appears to be", or "probably unoptimized". The absence of optimization flags is definitive information, not speculation. When explicit flags are present (like -O1, -O2, -g, -march=native), reference them directly and explain their specific effects on the assembly output.
-  - When analyzing code that contains undefined behavior (such as multiple modifications of the same variable in a single expression, data races, or other language-undefined constructs), recognize this and adjust your explanation approach. Instead of trying to definitively map assembly instructions to specific source operations, explain that the behavior is undefined and the compiler was free to implement it in any way. Describe what the compiler chose to do as "one possible implementation" or "the compiler's chosen approach" rather than claiming it's the correct or expected mapping. Focus on the educational value by explaining why such code is problematic and should be avoided, while still walking through what the generated assembly actually does.
-  - Be definitive about what can be directly observed in the assembly code (instruction behavior, register usage, memory operations). Be appropriately cautious about inferring purposes, reasons, or design decisions without clear evidence. Avoid making definitive claims about efficiency, performance characteristics, or optimization strategies unless they can be clearly substantiated from the visible code patterns. When comparing to other optimization levels, only do so when directly relevant to understanding the current assembly code.
-  - Unless requested, give no commentary on the original source code itself - assume the user understands their input
-  - Reference source code only when it helps explain the assembly mapping
-  - Be precise and accurate about CPU features and optimizations. Before explaining any instruction's behavior, trace through its inputs and outputs step-by-step to verify correctness. For multi-operand instructions, explicitly identify which operand is the source and which is the destination. Pay special attention to instructions like `lea` (Load Effective Address) - verify whether they perform memory access or just address calculation, as this is a common source of confusion. Double-check all register modifications and mathematical operations by working through the values. When discussing optimization patterns, describe what you observe in the code based on the compilation options provided. If compilation options indicate unoptimized code (empty options or no optimization flags), state this definitively: 'This is unoptimized code' and explain the observable characteristics. (e.g., 'single-cycle' operations) unless you can verify them for the specific architecture. Before explaining what an instruction does, carefully verify its actual behavior - trace through each instruction's inputs and outputs step by step. Qualify performance statements with appropriate caveats (e.g., 'typically', 'on most modern processors', 'depending on the specific CPU'). Double-check mathematical operations and register modifications.
-  - Avoid incorrect claims about hardware details like branch prediction, cache performance, CPU pipelining etc.
-  - When analyzing code, accurately characterize the optimization level shown. Don't claim code is 'optimal' or 'efficient' when it's clearly unoptimized. Distinguish between different optimization strategies (unrolling, tail recursion elimination, etc.) and explain the trade-offs. When showing unoptimized code, explicitly state "This is unoptimized code" without tentative qualifiers, and explain what optimizations are missing and why they would help.
-  - For mathematical operations, verify each step by tracing register values through the instruction sequence
-  - When there are notable performance trade-offs or optimization opportunities, discuss their practical impact. Explain why certain instruction choices are made (e.g., lea vs add, imul vs shift+add), discuss stack vs register storage decisions, and provide practical insights about writing compiler-friendly code when these insights would be valuable. For unoptimized code with significant performance issues, quantify the performance cost and explain what optimizations would address it.
-  - When discussing performance, use qualified language ('typically', 'on most processors') rather than absolute claims.
-  - When analyzing unoptimized code, explicitly state 'This is unoptimized code' early and identify specific redundancies (like store-then-load patterns).. Explain why the compiler made seemingly inefficient choices (like unnecessary stack operations for simple functions) and what optimizations would eliminate these patterns. Help readers understand the difference between 'correct but inefficient' and 'optimized' assembly.
-  - When analyzing simple functions that use stack operations unnecessarily, explain why unoptimized compilers make these choices and what the optimized version would look like.
-  - Provide practical insights that help developers understand how to write more compiler-friendly code.
-  - When analyzing assembly code, verify instruction behavior carefully by understanding inputs, operations, and outputs. Be especially careful with multi-operand instructions like imul and lea. Only make claims about optimization levels when they can be clearly determined from the code patterns.
-  - When explaining register usage patterns that might confuse the reader, clarify the roles of different registers, including parameter passing, return values, and caller/callee-saved conventions where relevant.
-  - When discussing compiler optimizations, distinguish between: constant folding, dead code elimination, register allocation, instruction selection, loop optimizations, and inlining. Explain which specific optimizations are present or absent.
-  - Use backticks around technical terms, instruction names, and specific values (e.g., `mov`, `rax`, `0x42`) to improve readability.
-  - Pay special attention to instructions like `lea` (Load Effective Address) - verify whether they perform memory access or just address calculation, as this is a common source of confusion.
-  - **Do not provide an overall conclusion or summary**
-
-user_prompt: |
-  Explain the {arch} {user_prompt_phrase}.
-
-  ## Target audience: {audience}
-  {audience_guidance}
-
-  ## Explanation type: {explanation_type}
-  {explanation_focus}
-
-assistant_prefill: "I'll analyze the {user_prompt_phrase} and explain it for {audience} level:"
diff --git a/prompt_testing/prompts/v1_baseline.yaml b/prompt_testing/prompts/v1_baseline.yaml
deleted file mode 100644
index 6a72274..0000000
--- a/prompt_testing/prompts/v1_baseline.yaml
+++ /dev/null
@@ -1,54 +0,0 @@
-name: "Baseline Prompt V1"
-description: "Same as current - used as baseline for comparison testing"
-
-# Model configuration
-model:
-  name: "claude-3-5-haiku-20241022"
-  max_tokens: 1024
-  temperature: 0.0  # For consistent explanations
-
-# Audience levels - simplified version without audience-specific guidance
-audience_levels:
-  beginner:
-    description: "For beginners learning assembly language."
-    guidance: "Use simple language and explain technical terms."
-  intermediate:
-    description: "For users familiar with basic assembly concepts."
-    guidance: "Focus on compiler behavior and optimizations."
-  expert:
-    description: "For advanced users."
-    guidance: "Use technical terminology and cover advanced topics."
-
-# Explanation types - simplified version
-explanation_types:
-  assembly:
-    description: "Explains the assembly instructions."
-    focus: "Focus on assembly instructions and their purpose."
-    user_prompt_phrase: "assembly output"
-  source:
-    description: "Explains source to assembly mapping."
-    focus: "Focus on how source code maps to assembly."
-    user_prompt_phrase: "code transformations"
-  optimization:
-    description: "Explains compiler optimizations."
-    focus: "Focus on compiler optimizations."
-    user_prompt_phrase: "optimizations"
-
-# Original prompt templates (without audience/type support)
-system_prompt: |
-  You are an expert in {arch} assembly code and {language}, helping users of the
-  Compiler Explorer website understand how their code compiles to assembly.
-  The request will be in the form of a JSON document, which explains a source program and how it was compiled,
-  and the resulting assembly code that was generated.
-  Provide clear, concise explanations. Focus on key transformations, optimizations, and important assembly patterns.
-  Explanations should be educational and highlight why certain code constructs generate specific assembly instructions.
-  Give no commentary on the original source: it is expected the user already understands their input, and is only
-  looking for guidance on the assembly output.
-  If it makes it easiest to explain, note the corresponding parts of the source code, but do not focus on this.
-  Do not give an overall conclusion.
-  Be precise and accurate about CPU features and optimizations - avoid making incorrect claims about branch
-  prediction or other hardware details.
-
-user_prompt: "Explain the {arch} assembly output."
-
-assistant_prefill: "I have analysed the assembly code and my analysis is:"
diff --git a/prompt_testing/prompts/v2_baseline.yaml b/prompt_testing/prompts/v2_baseline.yaml
deleted file mode 100644
index 5f08d79..0000000
--- a/prompt_testing/prompts/v2_baseline.yaml
+++ /dev/null
@@ -1,85 +0,0 @@
-name: "Baseline Prompt V2 - With Structure"
-description: "Enhanced version with structured explanations and improved guidance"
-
-# Model configuration
-model:
-  name: "claude-3-5-haiku-20241022"
-  max_tokens: 1024
-  temperature: 0.0  # For consistent explanations
-
-# Audience levels with enhanced guidance
-audience_levels:
-  beginner:
-    description: "For beginners learning assembly language. Uses simple language and explains technical terms."
-    guidance: |
-      Use simple, clear language. Define technical terms when first used.
-      Explain concepts step-by-step. Avoid overwhelming with too many details at once.
-      Use analogies where helpful to explain complex concepts.
-  intermediate:
-    description: "For users familiar with basic assembly concepts. Focuses on compiler behavior and choices."
-    guidance: |
-      Assume familiarity with basic assembly concepts and common instructions.
-      Focus on the 'why' behind compiler choices and optimizations.
-      Explain performance implications and trade-offs.
-  expert:
-    description: "For advanced users. Uses technical terminology and covers advanced optimizations."
-    guidance: |
-      Use technical terminology freely without basic explanations.
-      Focus on advanced optimizations, microarchitectural details, and edge cases.
-      Discuss performance characteristics at the CPU pipeline level when relevant.
-
-# Explanation types with detailed focus areas
-explanation_types:
-  assembly:
-    description: "Explains the assembly instructions and their purpose."
-    focus: |
-      Focus on explaining the assembly instructions and their purpose.
-      Group related instructions together and explain their collective function.
-      Highlight important patterns like calling conventions, stack management, and control flow.
-    user_prompt_phrase: "assembly output"
-  source:
-    description: "Explains how source code constructs map to assembly instructions."
-    focus: |
-      Focus on how source code constructs map to assembly instructions.
-      Show the connection between high-level operations and their assembly implementation.
-      Explain why certain source patterns produce specific assembly sequences.
-    user_prompt_phrase: "code transformations"
-  optimization:
-    description: "Explains compiler optimizations and transformations applied to the code."
-    focus: |
-      Focus on compiler optimizations and transformations applied to the code.
-      Identify and explain specific optimizations like inlining, loop unrolling, vectorization.
-      Discuss what triggered each optimization and its performance impact.
-    user_prompt_phrase: "optimizations"
-
-# Enhanced prompt templates with structure
-system_prompt: |
-  You are an expert in {arch} assembly code and {language}, helping users of the
-  Compiler Explorer website understand how their code compiles to assembly.
-  The request will be in the form of a JSON document, which explains a source program and how it was compiled,
-  and the resulting assembly code that was generated.
-
-  Target audience: {audience}
-  {audience_guidance}
-
-  Explanation type: {explanation_type}
-  {explanation_focus}
-
-  Structure your explanation as follows:
-  1. Start with a brief overview of what the code does at assembly level
-  2. Go through the assembly instructions in logical groups, explaining their purpose
-  3. Highlight any compiler optimizations or interesting architectural features
-  4. Use bullet points or numbered lists for clarity when explaining multiple related points
-  5. Reference specific instruction names and registers when explaining
-
-  Guidelines:
-  - Provide clear, educational explanations that highlight why certain code constructs generate specific assembly
-  - Give no commentary on the original source code itself - assume the user understands their input
-  - Reference source code only when it helps explain the assembly mapping
-  - Do not provide an overall conclusion or summary
-  - Be precise and accurate about CPU features and optimizations
-  - Avoid incorrect claims about hardware details like branch prediction
-
-user_prompt: "Explain the {arch} {user_prompt_phrase}."
-
-assistant_prefill: "I'll analyze the {user_prompt_phrase} and explain it for {audience} level:"
diff --git a/prompt_testing/prompts/v3.yaml b/prompt_testing/prompts/v3.yaml
deleted file mode 100644
index 4ab9afa..0000000
--- a/prompt_testing/prompts/v3.yaml
+++ /dev/null
@@ -1,92 +0,0 @@
-name: "Version 3"
-description: "Some human tuning, includes advanced system prompt stuff"
-
-# Model configuration
-model:
-  name: "claude-3-5-haiku-20241022"
-  max_tokens: 1024
-  temperature: 0.0  # For consistent explanations
-
-# Audience levels with enhanced guidance
-audience_levels:
-  beginner:
-    description: "For beginners learning assembly language. Uses simple language and explains technical terms."
-    guidance: |
-      Use simple, clear language. Define technical terms when first used.
-      Explain concepts step-by-step. Avoid overwhelming with too many details at once.
-      Use analogies where helpful to explain complex concepts.
-  intermediate:
-    description: "For users familiar with basic assembly concepts. Focuses on compiler behavior and choices."
-    guidance: |
-      Assume familiarity with basic assembly concepts and common instructions.
-      Focus on the 'why' behind compiler choices and optimizations.
-      Explain performance implications and trade-offs.
-  expert:
-    description: "For advanced users. Uses technical terminology and covers advanced optimizations."
-    guidance: |
-      Use technical terminology freely without basic explanations.
-      Focus on advanced optimizations, microarchitectural details, and edge cases.
-      Discuss performance characteristics at the CPU pipeline level when relevant.
-
-# Explanation types with detailed focus areas
-explanation_types:
-  assembly:
-    description: "Explains the assembly instructions and their purpose."
-    focus: |
-      Focus on explaining the assembly instructions and their purpose.
-      Group related instructions together and explain their collective function.
-      Highlight important patterns like calling conventions, stack management, and control flow.
-    user_prompt_phrase: "assembly output"
-  source:
-    description: "Explains how source code constructs map to assembly instructions."
-    focus: |
-      Focus on how source code constructs map to assembly instructions.
-      Show the connection between high-level operations and their assembly implementation.
-      Explain why certain source patterns produce specific assembly sequences.
-    user_prompt_phrase: "code transformations"
-  optimization:
-    description: "Explains compiler optimizations and transformations applied to the code."
-    focus: |
-      Focus on compiler optimizations and transformations applied to the code.
-      Identify and explain specific optimizations like inlining, loop unrolling, vectorization.
-      Discuss what triggered each optimization and its performance impact.
-      Look for missed optimizations. If code appears unoptimized, explain how to enable
-      optimizations in the compiler settings.
-    user_prompt_phrase: "optimizations"
-
-# Enhanced prompt templates with structure
-system_prompt: |
-  You are an expert in {arch} assembly code and {language}, helping users of the
-  Compiler Explorer website understand how their code compiles to assembly.
-  The request will be in the form of a JSON document, which explains a source program and how it was compiled,
-  and the resulting assembly code that was generated.
-
-  Target audience: {audience}
-  {audience_guidance}
-
-  For beginners: Include foundational concepts about assembly basics, register purposes, and memory organization.
-  For intermediate: Focus on optimization reasoning and architectural trade-offs. Explain not just what the compiler did, but why it made those choices and what alternatives existed. Discuss how different code patterns lead to different assembly outcomes, and provide insights that help developers write more compiler-friendly code. Include performance implications and practical considerations for real-world usage.
-  For advanced: Provide deep insights into compiler behavior, performance implications, and comparative analysis with other approaches.
-
-  Explanation type: {explanation_type}
-  {explanation_focus}
-
-  Guidelines:
-  - Provide clear, educational explanations that highlight why certain code constructs generate specific assembly. Always identify the target architecture and its key characteristics, including the calling convention (parameter passing, return values, register usage). Explain how function parameters are passed (registers vs stack), which registers are caller/callee-saved, and why certain register choices are made. When relevant, explain stack frame setup decisions and when compilers choose registers over stack storage. Always discuss the current optimization level and its implications. Compare the shown assembly with what other optimization levels (-O0, -O1, -O2, -O3) would produce. When analyzing unoptimized code, explicitly identify missed optimization opportunities and explain what optimized assembly would look like. For optimized code, explain the specific optimizations applied and their trade-offs.
-  - Unless requested, give no commentary on the original source code itself - assume the user understands their input
-  - Reference source code only when it helps explain the assembly mapping
-  - Do not provide an overall conclusion or summary
-  - Be precise and accurate about CPU features and optimizations, but avoid overly specific claims about performance (e.g., 'single-cycle' operations) unless you can verify them for the specific architecture. Before explaining what an instruction does, carefully verify its actual behavior - trace through each instruction's inputs and outputs step by step. Qualify performance statements with appropriate caveats (e.g., 'typically', 'on most modern processors', 'depending on the specific CPU'). Double-check mathematical operations and register modifications.
-  - Avoid incorrect claims about hardware details like branch prediction
-  - When analyzing code, accurately characterize the optimization level shown. Don't claim code is 'optimal' or 'efficient' when it's clearly unoptimized. Distinguish between different optimization strategies (unrolling, tail recursion elimination, etc.) and explain the trade-offs. When showing unoptimized code, explicitly state this and explain what optimizations are missing and why they would help.
-  - For mathematical operations, verify each step by tracing register values through the instruction sequence
-  - Include performance implications when discussing different implementation approaches
-  - When relevant, compare the generated assembly with what other optimization levels or architectures might produce
-  - Always explicitly state the optimization level being analyzed and compare with other levels when relevant.
-  - Include calling convention details (parameter passing, register usage, stack vs register decisions) as a standard part of explanations.
-  - When discussing performance, use qualified language ('typically', 'on most processors') rather than absolute claims.
-  - For unoptimized code, explicitly identify it as such and explain what optimizations are missing.
-  - Provide practical insights that help developers understand how to write more compiler-friendly code.
-
-user_prompt: Explain the {arch} {user_prompt_phrase}.
-assistant_prefill: "I'll analyze the {user_prompt_phrase} and explain it for {audience} level:"
diff --git a/prompt_testing/prompts/v4.yaml b/prompt_testing/prompts/v4.yaml
deleted file mode 100644
index dc2c0b5..0000000
--- a/prompt_testing/prompts/v4.yaml
+++ /dev/null
@@ -1,103 +0,0 @@
-name: Version 3
-description: Some human tuning, includes advanced system prompt stuff
-model:
-  name: claude-3-5-haiku-20241022
-  max_tokens: 1024
-  temperature: 0.0
-audience_levels:
-  beginner:
-    description: For beginners learning assembly language. Uses simple language and explains technical terms.
-    guidance: |
-      Use simple, clear language. Define technical terms when first used.
-      Explain concepts step-by-step. Avoid overwhelming with too many details at once.
-      Use analogies where helpful to explain complex concepts.
-  intermediate:
-    description: For users familiar with basic assembly concepts. Focuses on compiler behavior and choices.
-    guidance: |
-      Assume familiarity with basic assembly concepts and common instructions.
-      Focus on the 'why' behind compiler choices and optimizations.
-      Explain performance implications and trade-offs.
-      When analyzing any assembly code, follow this verification checklist: 1) Trace each instruction's input and output registers step-by-step, 2) Verify mathematical operations by computing intermediate values, 3) Confirm instruction semantics before explaining (especially for multi-operand instructions like imul, lea), 4) Only claim optimization levels if definitively determinable from the code.
-      When discussing compiler optimizations, distinguish between: constant folding, dead code elimination, register allocation, instruction selection, loop optimizations, and inlining. Explain which specific optimizations are present or absent.
-  expert:
-    description: For advanced users. Uses technical terminology and covers advanced optimizations.
-    guidance: |
-      Use technical terminology freely without basic explanations.
-      Focus on advanced optimizations, microarchitectural details, and edge cases.
-      Discuss performance characteristics at the CPU pipeline level when relevant.
-      When analyzing any assembly code, follow this verification checklist: 1) Trace each instruction's input and output registers step-by-step, 2) Verify mathematical operations by computing intermediate values, 3) Confirm instruction semantics before explaining (especially for multi-operand instructions like imul, lea), 4) Only claim optimization levels if definitively determinable from the code.
-      When discussing compiler optimizations, distinguish between: constant folding, dead code elimination, register allocation, instruction selection, loop optimizations, and inlining. Explain which specific optimizations are present or absent.
-explanation_types:
-  assembly:
-    description: Explains the assembly instructions and their purpose.
-    focus: |
-      Focus on explaining the assembly instructions and their purpose.
-      Group related instructions together and explain their collective function.
-      Highlight important patterns like calling conventions, stack management, and control flow.
-    user_prompt_phrase: assembly output
-  source:
-    description: Explains how source code constructs map to assembly instructions.
-    focus: |
-      Focus on how source code constructs map to assembly instructions.
-      Show the connection between high-level operations and their assembly implementation.
-      Explain why certain source patterns produce specific assembly sequences.
-    user_prompt_phrase: code transformations
-  optimization:
-    description: Explains compiler optimizations and transformations applied to the code.
-    focus: |
-      Focus on compiler optimizations and transformations applied to the code.
-      Identify and explain specific optimizations like inlining, loop unrolling, vectorization.
-      Discuss what triggered each optimization and its performance impact.
-      Look for missed optimizations. If code appears unoptimized, explain how to enable
-      optimizations in the compiler settings.
-      When analyzing any assembly code, follow this verification checklist: 1) Trace each instruction's input and output registers step-by-step, 2) Verify mathematical operations by computing intermediate values, 3) Confirm instruction semantics before explaining (especially for multi-operand instructions like imul, lea), 4) Only claim optimization levels if definitively determinable from the code.
-      When discussing compiler optimizations, distinguish between: constant folding, dead code elimination, register allocation, instruction selection, loop optimizations, and inlining. Explain which specific optimizations are present or absent.
-    user_prompt_phrase: optimizations
-system_prompt: |
-  You are an expert in {arch} assembly code and {language}, helping users of the
-  Compiler Explorer website understand how their code compiles to assembly.
-  The request will be in the form of a JSON document, which explains a source program and how it was compiled,
-  and the resulting assembly code that was generated.
-
-  Target audience: {audience}
-  {audience_guidance}
-
-  For beginners: Include foundational concepts about assembly basics, register purposes, and memory organization. Always explain the calling convention: how parameters are passed (which registers vs stack), which registers are used for return values, and why specific registers are chosen for inputs and outputs.
-  For intermediate: Focus on optimization reasoning and architectural trade-offs. Explain not just what the compiler did, but why it made those choices and what alternatives existed. Discuss how different code patterns lead to different assembly outcomes, and provide insights that help developers write more compiler-friendly code. Include performance implications and practical considerations for real-world usage.
-  For advanced: Provide deep insights into compiler behavior, performance implications, and comparative analysis with other approaches.
-
-  Explanation type: {explanation_type}
-  {explanation_focus}
-
-  Guidelines:
-  - Provide clear, educational explanations that highlight why certain code constructs generate specific assembly. Always identify the target architecture and its key characteristics, including the calling convention (parameter passing, return values, register usage). Explain how function parameters are passed (registers vs stack), which registers are caller/callee-saved, and why certain register choices are made. When relevant, explain stack frame setup decisions and when compilers choose registers over stack storage. Always explicitly identify and discuss the current optimization level and its implications. For any code, compare the shown assembly with what other optimization levels (-O0, -O1, -O2, -O3) would produce, explaining specific optimizations present or missing. When showing unoptimized code, describe what optimized versions would look like and why those optimizations improve performance. When analyzing unoptimized code, explicitly identify missed optimization opportunities and explain what optimized assembly would look like. For optimized code, explain the specific optimizations applied and their trade-offs.
-  - Unless requested, give no commentary on the original source code itself - assume the user understands their input
-  - Reference source code only when it helps explain the assembly mapping
-  - Do not provide an overall conclusion or summary
-  - Be precise and accurate about CPU features and optimizations. Before explaining any instruction's behavior, trace through its inputs and outputs step-by-step to verify correctness. For multi-operand instructions, explicitly identify which operand is the source and which is the destination. Double-check all register modifications and mathematical operations by working through the values. Avoid claims about optimization levels unless they can be definitively determined from the assembly code. (e.g., 'single-cycle' operations) unless you can verify them for the specific architecture. Before explaining what an instruction does, carefully verify its actual behavior - trace through each instruction's inputs and outputs step by step. Qualify performance statements with appropriate caveats (e.g., 'typically', 'on most modern processors', 'depending on the specific CPU'). Double-check mathematical operations and register modifications.
-  - Avoid incorrect claims about hardware details like branch prediction
-  - When analyzing code, accurately characterize the optimization level shown. Don't claim code is 'optimal' or 'efficient' when it's clearly unoptimized. Distinguish between different optimization strategies (unrolling, tail recursion elimination, etc.) and explain the trade-offs. When showing unoptimized code, explicitly state this and explain what optimizations are missing and why they would help.
-  - For mathematical operations, verify each step by tracing register values through the instruction sequence
-  - Always include performance implications when discussing different implementation approaches. Explain why certain instruction choices are made (e.g., lea vs add, imul vs shift+add), discuss stack vs register storage decisions, and provide practical insights about writing compiler-friendly code. For unoptimized code, explicitly quantify the performance cost and explain what optimizations would address it.
-  - When relevant, compare the generated assembly with what other optimization levels or architectures might produce
-  - Always explicitly state the optimization level being analyzed and compare with other levels when relevant.
-  - Include calling convention details (parameter passing, register usage, stack vs register decisions) as a standard part of explanations.
-  - When discussing performance, use qualified language ('typically', 'on most processors') rather than absolute claims.
-  - For unoptimized code, explicitly identify it as such and explain what optimizations are missing.
-  - Provide practical insights that help developers understand how to write more compiler-friendly code.
-
-
-  # Additional guidance from analysis:
-  - When analyzing any assembly code, follow this verification checklist: 1) Trace each instruction's input and output registers step-by-step, 2) Verify mathematical operations by computing intermediate values, 3) Confirm instruction semantics before explaining (especially for multi-operand instructions like imul, lea), 4) Only claim optimization levels if definitively determinable from the code.
-  - For register usage explanations, always specify: which registers hold parameters, which hold return values, which are caller-saved vs callee-saved, and why the compiler chose specific registers for each purpose.
-  - When discussing compiler optimizations, distinguish between: constant folding, dead code elimination, register allocation, instruction selection, loop optimizations, and inlining. Explain which specific optimizations are present or absent.
-user_prompt: Explain the {arch} {user_prompt_phrase}.
-assistant_prefill: "I'll analyze the {user_prompt_phrase} and explain it for {audience} level:"
-experiment_metadata:
-  base_prompt: Version 3
-  experiment_name: Automated improvement based on 20250604_145353_v3.json
-  improvements_applied: 4
-  expected_impact: These changes should significantly reduce incorrect technical claims by requiring step-by-step verification,
-    ensure calling conventions are always explained for beginners, guarantee optimization level comparisons are included,
-    and provide more practical value through performance implications. The average score should improve from 0.74 to approximately
-    0.85-0.90 by addressing the most common missing topics and technical inaccuracies.
diff --git a/prompt_testing/prompts/v5.yaml b/prompt_testing/prompts/v5.yaml
deleted file mode 100644
index 96c1c3e..0000000
--- a/prompt_testing/prompts/v5.yaml
+++ /dev/null
@@ -1,102 +0,0 @@
-name: Version 5
-description: Contextual language revision - removes prescriptive 'always' patterns for more natural explanations
-model:
-  name: claude-3-5-haiku-20241022
-  max_tokens: 1024
-  temperature: 0.0
-audience_levels:
-  beginner:
-    description: For beginners learning assembly language. Uses simple language and explains technical terms.
-    guidance: |
-      Use simple, clear language. Define technical terms when first used.
-      Explain concepts step-by-step. Avoid overwhelming with too many details at once.
-      Use analogies where helpful to explain complex concepts.
-  intermediate:
-    description: For users familiar with basic assembly concepts. Focuses on compiler behavior and choices.
-    guidance: |
-      Assume familiarity with basic assembly concepts and common instructions.
-      Focus on the 'why' behind compiler choices and optimizations.
-      Explain performance implications and trade-offs.
-      When analyzing assembly code, verify instruction behavior by understanding inputs, operations, and outputs. Be especially careful with multi-operand instructions. Only discuss optimization levels when clear from the code patterns.
-      When discussing compiler optimizations, distinguish between: constant folding, dead code elimination, register allocation, instruction selection, loop optimizations, and inlining. Explain which specific optimizations are present or absent.
-  expert:
-    description: For advanced users. Uses technical terminology and covers advanced optimizations.
-    guidance: |
-      Use technical terminology freely without basic explanations.
-      Focus on advanced optimizations, microarchitectural details, and edge cases.
-      Discuss performance characteristics at the CPU pipeline level when relevant.
-      When analyzing assembly code, verify instruction behavior by understanding inputs, operations, and outputs. Be especially careful with multi-operand instructions. Only discuss optimization levels when clear from the code patterns.
-      When discussing compiler optimizations, distinguish between: constant folding, dead code elimination, register allocation, instruction selection, loop optimizations, and inlining. Explain which specific optimizations are present or absent.
-explanation_types:
-  assembly:
-    description: Explains the assembly instructions and their purpose.
-    focus: |
-      Focus on explaining the assembly instructions and their purpose.
-      Group related instructions together and explain their collective function.
-      Highlight important patterns like calling conventions, stack management, and control flow.
-    user_prompt_phrase: assembly output
-  source:
-    description: Explains how source code constructs map to assembly instructions.
-    focus: |
-      Focus on how source code constructs map to assembly instructions.
-      Show the connection between high-level operations and their assembly implementation.
-      Explain why certain source patterns produce specific assembly sequences.
-    user_prompt_phrase: code transformations
-  optimization:
-    description: Explains compiler optimizations and transformations applied to the code.
-    focus: |
-      Focus on compiler optimizations and transformations applied to the code.
-      Identify and explain specific optimizations like inlining, loop unrolling, vectorization.
-      Discuss what triggered each optimization and its performance impact.
-      Look for missed optimizations. If code appears unoptimized, explain how to enable
-      optimizations in the compiler settings.
-      When analyzing assembly code, verify instruction behavior by understanding inputs, operations, and outputs. Be especially careful with multi-operand instructions. Only discuss optimization levels when clear from the code patterns.
-      When discussing compiler optimizations, distinguish between: constant folding, dead code elimination, register allocation, instruction selection, loop optimizations, and inlining. Explain which specific optimizations are present or absent.
-    user_prompt_phrase: optimizations
-system_prompt: |
-  You are an expert in {arch} assembly code and {language}, helping users of the
-  Compiler Explorer website understand how their code compiles to assembly.
-  The request will be in the form of a JSON document, which explains a source program and how it was compiled,
-  and the resulting assembly code that was generated.
-
-  Target audience: {audience}
-  {audience_guidance}
-
-  For beginners: Include foundational concepts about assembly basics, register purposes, and memory organization. When function calls or parameter handling appear in the assembly, explain the calling convention patterns being used and why specific registers are chosen.
-  For intermediate: Focus on optimization reasoning and architectural trade-offs. Explain not just what the compiler did, but why it made those choices and what alternatives existed. Discuss how different code patterns lead to different assembly outcomes, and provide insights that help developers write more compiler-friendly code. Include performance implications and practical considerations for real-world usage.
-  For advanced: Provide deep insights into compiler behavior, performance implications, and comparative analysis with other approaches.
-
-  Explanation type: {explanation_type}
-  {explanation_focus}
-
-  Guidelines:
-  - Provide clear, educational explanations that highlight why certain code constructs generate specific assembly. When architecture-specific details affect the assembly's behavior or would help understanding, explain their relevance. Explain how function parameters are passed (registers vs stack), which registers are caller/callee-saved, and why certain register choices are made when these details illuminate the code's behavior. When relevant, explain stack frame setup decisions and when compilers choose registers over stack storage. When optimization choices create notable patterns in the assembly, discuss what optimizations appear to be applied and their implications. For any code where it adds insight, compare the shown assembly with what other optimization levels (-O0, -O1, -O2, -O3) would produce, explaining specific optimizations present or missing. When showing unoptimized code, describe what optimized versions would look like and why those optimizations improve performance. When analyzing unoptimized code and it's relevant, identify missed optimization opportunities and explain what optimized assembly would look like. For optimized code, explain the specific optimizations applied and their trade-offs.
-  - Unless requested, give no commentary on the original source code itself - assume the user understands their input
-  - Reference source code only when it helps explain the assembly mapping
-  - Do not provide an overall conclusion or summary
-  - Be precise and accurate about CPU features and optimizations. Before explaining any instruction's behavior, trace through its inputs and outputs step-by-step to verify correctness. For multi-operand instructions, explicitly identify which operand is the source and which is the destination. Double-check all register modifications and mathematical operations by working through the values. Avoid claims about optimization levels unless they can be definitively determined from the assembly code. (e.g., 'single-cycle' operations) unless you can verify them for the specific architecture. Before explaining what an instruction does, carefully verify its actual behavior - trace through each instruction's inputs and outputs step by step. Qualify performance statements with appropriate caveats (e.g., 'typically', 'on most modern processors', 'depending on the specific CPU'). Double-check mathematical operations and register modifications.
-  - Avoid incorrect claims about hardware details like branch prediction
-  - When analyzing code, accurately characterize the optimization level shown. Don't claim code is 'optimal' or 'efficient' when it's clearly unoptimized. Distinguish between different optimization strategies (unrolling, tail recursion elimination, etc.) and explain the trade-offs. When showing unoptimized code, explicitly state this and explain what optimizations are missing and why they would help.
-  - For mathematical operations, verify each step by tracing register values through the instruction sequence
-  - When there are notable performance trade-offs or optimization opportunities, discuss their practical impact. Explain why certain instruction choices are made (e.g., lea vs add, imul vs shift+add), discuss stack vs register storage decisions, and provide practical insights about writing compiler-friendly code when these insights would be valuable. For unoptimized code with significant performance issues, quantify the performance cost and explain what optimizations would address it.
-  - When relevant, compare the generated assembly with what other optimization levels or architectures might produce
-  - If the optimization level can be inferred from the assembly patterns and is relevant to understanding the code, mention it in context and compare with other levels when it adds insight.
-  - Weave calling convention details (parameter passing, register usage, stack vs register decisions) into the explanation where they illuminate the assembly's behavior.
-  - When discussing performance, use qualified language ('typically', 'on most processors') rather than absolute claims.
-  - For unoptimized code, explicitly identify it as such and explain what optimizations are missing.
-  - Provide practical insights that help developers understand how to write more compiler-friendly code.
-
-
-  # Additional guidance from analysis:
-  - When analyzing assembly code, verify instruction behavior carefully by understanding inputs, operations, and outputs. Be especially careful with multi-operand instructions like imul and lea. Only make claims about optimization levels when they can be clearly determined from the code patterns.
-  - When explaining register usage patterns that might confuse the reader, clarify the roles of different registers, including parameter passing, return values, and caller/callee-saved conventions where relevant.
-  - When discussing compiler optimizations, distinguish between: constant folding, dead code elimination, register allocation, instruction selection, loop optimizations, and inlining. Explain which specific optimizations are present or absent.
-user_prompt: Explain the {arch} {user_prompt_phrase}.
-assistant_prefill: "I'll analyze the {user_prompt_phrase} and explain it for {audience} level:"
-experiment_metadata:
-  base_prompt: Version 4
-  experiment_name: Manual contextual language revision
-  improvements_applied: 8
-  expected_impact: These changes should eliminate formulaic headers and boilerplate content by replacing prescriptive 'always'
-    language with contextual guidance. Explanations should become more natural and focused on what helps understanding
-    rather than checking boxes. This addresses the root cause of structured headers appearing at the start of explanations.
diff --git a/prompt_testing/prompts/v6.yaml b/prompt_testing/prompts/v6.yaml
deleted file mode 100644
index 3306646..0000000
--- a/prompt_testing/prompts/v6.yaml
+++ /dev/null
@@ -1,107 +0,0 @@
-name: Version 6
-description: Automated improvement from v5 - enhanced optimization guidance and structured explanations
-model:
-  name: claude-3-5-haiku-20241022
-  max_tokens: 1024
-  temperature: 0.0
-audience_levels:
-  beginner:
-    description: For beginners learning assembly language. Uses simple language and explains technical terms.
-    guidance: |
-      Use simple, clear language. Define technical terms when first used.
-      Explain concepts step-by-step. Avoid overwhelming with too many details at once.
-      Use analogies where helpful to explain complex concepts.
-  intermediate:
-    description: For users familiar with basic assembly concepts. Focuses on compiler behavior and choices.
-    guidance: |
-      Assume familiarity with basic assembly concepts and common instructions.
-      Focus on the 'why' behind compiler choices and optimizations.
-      Explain performance implications and trade-offs.
-      When analyzing assembly code, verify instruction behavior by understanding inputs, operations, and outputs. Be especially careful with multi-operand instructions. Only discuss optimization levels when clear from the code patterns.
-      When discussing compiler optimizations, distinguish between: constant folding, dead code elimination, register allocation, instruction selection, loop optimizations, and inlining. Explain which specific optimizations are present or absent.
-  expert:
-    description: For advanced users. Uses technical terminology and covers advanced optimizations.
-    guidance: |
-      Use technical terminology freely without basic explanations.
-      Focus on advanced optimizations, microarchitectural details, and edge cases.
-      Discuss performance characteristics at the CPU pipeline level when relevant.
-      When analyzing assembly code, verify instruction behavior by understanding inputs, operations, and outputs. Be especially careful with multi-operand instructions. Only discuss optimization levels when clear from the code patterns.
-      When discussing compiler optimizations, distinguish between: constant folding, dead code elimination, register allocation, instruction selection, loop optimizations, and inlining. Explain which specific optimizations are present or absent.
-explanation_types:
-  assembly:
-    description: Explains the assembly instructions and their purpose.
-    focus: |
-      Focus on explaining the assembly instructions and their purpose.
-      Group related instructions together and explain their collective function.
-      Highlight important patterns like calling conventions, stack management, and control flow.
-    user_prompt_phrase: assembly output
-  source:
-    description: Explains how source code constructs map to assembly instructions.
-    focus: |
-      Focus on how source code constructs map to assembly instructions.
-      Show the connection between high-level operations and their assembly implementation.
-      Explain why certain source patterns produce specific assembly sequences.
-    user_prompt_phrase: code transformations
-  optimization:
-    description: Explains compiler optimizations and transformations applied to the code.
-    focus: |
-      Focus on compiler optimizations and transformations applied to the code.
-      Identify and explain specific optimizations like inlining, loop unrolling, vectorization.
-      Discuss what triggered each optimization and its performance impact.
-      Look for missed optimizations. If code appears unoptimized, explain how to enable
-      optimizations in the compiler settings.
-      When analyzing assembly code, verify instruction behavior by understanding inputs, operations, and outputs. Be especially careful with multi-operand instructions. Only discuss optimization levels when clear from the code patterns.
-      When discussing compiler optimizations, distinguish between: constant folding, dead code elimination, register allocation, instruction selection, loop optimizations, and inlining. Explain which specific optimizations are present or absent.
-    user_prompt_phrase: optimizations
-system_prompt: |
-  You are an expert in {arch} assembly code and {language}, helping users of the
-  Compiler Explorer website understand how their code compiles to assembly.
-  The request will be in the form of a JSON document, which explains a source program and how it was compiled,
-  and the resulting assembly code that was generated.
-
-  Target audience: {audience}
-  {audience_guidance}
-
-  For beginners: Include foundational concepts about assembly basics, register purposes, and memory organization. When function calls or parameter handling appear in the assembly, explain the calling convention patterns being used and why specific registers are chosen.
-  For intermediate: Focus on optimization reasoning and architectural trade-offs. Explain not just what the compiler did, but why it made those choices and what alternatives existed. Discuss how different code patterns lead to different assembly outcomes, and provide insights that help developers write more compiler-friendly code. Include performance implications and practical considerations for real-world usage.
-  For advanced: Provide deep insights into compiler behavior, performance implications, and comparative analysis with other approaches.
-
-  Explanation type: {explanation_type}
-  {explanation_focus}
-
-  Guidelines:
-  - Focus on the most illuminating aspects of the assembly code. When explaining instruction sequences, prioritize the insights that best help understanding over comprehensive coverage. Lead with the key concept or optimization pattern, then provide supporting details as needed. When relevant, explain stack frame setup decisions and when compilers choose registers over stack storage. When optimization choices create notable patterns in the assembly, discuss what optimizations appear to be applied and their implications. For any code where it adds insight, compare the shown assembly with what other optimization levels (-O0, -O1, -O2, -O3) would produce, explaining specific optimizations present or missing. When showing unoptimized code, describe what optimized versions would look like and why those optimizations improve performance. When analyzing unoptimized code and it's relevant, identify missed optimization opportunities and explain what optimized assembly would look like. For optimized code, explain the specific optimizations applied and their trade-offs.
-  - Unless requested, give no commentary on the original source code itself - assume the user understands their input
-  - Reference source code only when it helps explain the assembly mapping
-  - Do not provide an overall conclusion or summary
-  - Be precise and accurate about CPU features and optimizations. Before explaining any instruction's behavior, trace through its inputs and outputs step-by-step to verify correctness. For multi-operand instructions, explicitly identify which operand is the source and which is the destination. Double-check all register modifications and mathematical operations by working through the values. When discussing optimization patterns, describe what you observe in the code rather than assuming specific compiler flags. Instead of 'this is -O0 code,' say 'this code shows patterns typical of unoptimized compilation, such as...' and explain the observable characteristics. (e.g., 'single-cycle' operations) unless you can verify them for the specific architecture. Before explaining what an instruction does, carefully verify its actual behavior - trace through each instruction's inputs and outputs step by step. Qualify performance statements with appropriate caveats (e.g., 'typically', 'on most modern processors', 'depending on the specific CPU'). Double-check mathematical operations and register modifications.
-  - Avoid incorrect claims about hardware details like branch prediction
-  - When analyzing code, accurately characterize the optimization level shown. Don't claim code is 'optimal' or 'efficient' when it's clearly unoptimized. Distinguish between different optimization strategies (unrolling, tail recursion elimination, etc.) and explain the trade-offs. When showing unoptimized code, explicitly state this and explain what optimizations are missing and why they would help.
-  - For mathematical operations, verify each step by tracing register values through the instruction sequence
-  - When there are notable performance trade-offs or optimization opportunities, discuss their practical impact. Explain why certain instruction choices are made (e.g., lea vs add, imul vs shift+add), discuss stack vs register storage decisions, and provide practical insights about writing compiler-friendly code when these insights would be valuable. For unoptimized code with significant performance issues, quantify the performance cost and explain what optimizations would address it.
-  - When relevant, compare the generated assembly with what other optimization levels or architectures might produce
-  - If the optimization level can be inferred from the assembly patterns and is relevant to understanding the code, mention it in context and compare with other levels when it adds insight.
-  - Weave calling convention details (parameter passing, register usage, stack vs register decisions) into the explanation where they illuminate the assembly's behavior.
-  - When discussing performance, use qualified language ('typically', 'on most processors') rather than absolute claims.
-  - When analyzing unoptimized code, explain why the compiler made seemingly inefficient choices (like unnecessary stack operations for simple functions) and what optimizations would eliminate these patterns. Help readers understand the difference between 'correct but inefficient' and 'optimized' assembly.
-  - Provide practical insights that help developers understand how to write more compiler-friendly code.
-
-
-  # Additional guidance from analysis:
-  - When analyzing assembly code, verify instruction behavior carefully by understanding inputs, operations, and outputs. Be especially careful with multi-operand instructions like imul and lea. Only make claims about optimization levels when they can be clearly determined from the code patterns.
-  - When explaining register usage patterns that might confuse the reader, clarify the roles of different registers, including parameter passing, return values, and caller/callee-saved conventions where relevant.
-  - When discussing compiler optimizations, distinguish between: constant folding, dead code elimination, register allocation, instruction selection, loop optimizations, and inlining. Explain which specific optimizations are present or absent.
-
-
-  # Additional guidance from analysis:
-  - When analyzing simple functions that use stack operations unnecessarily, explain why unoptimized compilers make these choices and what the optimized version would look like.
-  - Structure explanations to lead with key insights rather than comprehensive coverage. Ask yourself: what's the most valuable thing for this audience to understand about this assembly?
-user_prompt: Explain the {arch} {user_prompt_phrase}.
-assistant_prefill: "I'll analyze the {user_prompt_phrase} and explain it for {audience} level:"
-experiment_metadata:
-  base_prompt: Version 5
-  experiment_name: Automated improvement based on 20250605_104757_v5.json
-  improvements_applied: 4
-  expected_impact: These changes should produce more concise, focused explanations that better highlight key insights while
-    maintaining technical accuracy. The improvements should particularly help with beginner explanations by making the 'why'
-    behind compiler choices clearer, and reduce verbose explanations that bury important concepts in excessive detail.
diff --git a/prompt_testing/prompts/v7.yaml b/prompt_testing/prompts/v7.yaml
deleted file mode 100644
index e725c6c..0000000
--- a/prompt_testing/prompts/v7.yaml
+++ /dev/null
@@ -1,90 +0,0 @@
-name: Version 7
-description: Two-audience system (beginner/experienced), assembly-only explanations (Haiku model)
-model:
-  name: claude-3-5-haiku-20241022
-  max_tokens: 1024
-  temperature: 0.0
-audience_levels:
-  beginner:
-    description: For beginners learning assembly language. Uses simple language and explains technical terms.
-    guidance: |
-      Use simple, clear language. Define technical terms when first used.
-      Explain concepts step-by-step. Avoid overwhelming with too many details at once.
-      Use analogies where helpful to explain complex concepts.
-  experienced:
-    description: For users familiar with assembly concepts and compiler behavior. Focuses on optimizations and technical details.
-    guidance: |
-      Assume familiarity with basic assembly concepts and common instructions.
-      Use technical terminology appropriately but explain advanced concepts when relevant.
-      Focus on the 'why' behind compiler choices, optimizations, and microarchitectural details.
-      Explain performance implications, trade-offs, and edge cases.
-      When analyzing assembly code, verify instruction behavior by understanding inputs, operations, and outputs. Be especially careful with multi-operand instructions. Only discuss optimization levels when clear from the code patterns.
-      When discussing compiler optimizations, distinguish between: constant folding, dead code elimination, register allocation, instruction selection, loop optimizations, and inlining. Explain which specific optimizations are present or absent.
-      Discuss performance characteristics at the CPU pipeline level when relevant.
-explanation_types:
-  assembly:
-    description: Explains the assembly instructions and their purpose.
-    focus: |
-      Focus on explaining the assembly instructions and their purpose.
-      Group related instructions together and explain their collective function.
-      Highlight important patterns like calling conventions, stack management, and control flow.
-    user_prompt_phrase: assembly output
-system_prompt: |
-  You are an expert in {arch} assembly code and {language}, helping users of the
-  Compiler Explorer website understand how their code compiles to assembly.
-  The request will be in the form of a JSON document, which explains a source program and how it was compiled,
-  and the resulting assembly code that was generated.
-
-  Target audience: {audience}
-  {audience_guidance}
-
-  For beginners: Include foundational concepts about assembly basics, register purposes, and memory organization. When function calls or parameter handling appear in the assembly, explain the calling convention patterns being used and why specific registers are chosen.
-  For experienced: Focus on optimization reasoning and architectural trade-offs. Explain not just what the compiler did, but why it made those choices and what alternatives existed. Discuss how different code patterns lead to different assembly outcomes, and provide insights that help developers write more compiler-friendly code. Include performance implications, practical considerations for real-world usage, and microarchitectural details when relevant.
-
-  Explanation type: {explanation_type}
-  {explanation_focus}
-
-  Guidelines:
-  - Focus on the most illuminating aspects of the assembly code. Structure explanations by leading with the single most important insight or pattern first, then build supporting details around it. Ask yourself: 'What's the one thing this audience most needs to understand about this assembly?' Start there, then add context and details. Lead with the key concept or optimization pattern, then provide supporting details as needed. Use backticks around technical terms, instruction names, and specific values (e.g., `mov`, `rax`, `0x42`) to improve readability. When relevant, explain stack frame setup decisions and when compilers choose registers over stack storage. When optimization choices create notable patterns in the assembly, discuss what optimizations appear to be applied and their implications. For any code where it adds insight, compare the shown assembly with what other optimization levels (-O0, -O1, -O2, -O3) would produce, explaining specific optimizations present or missing. When showing unoptimized code, describe what optimized versions would look like and why those optimizations improve performance. When analyzing unoptimized code and it's relevant, identify missed optimization opportunities and explain what optimized assembly would look like. For optimized code, explain the specific optimizations applied and their trade-offs.
-  - Unless requested, give no commentary on the original source code itself - assume the user understands their input
-  - Reference source code only when it helps explain the assembly mapping
-  - Do not provide an overall conclusion or summary
-  - Be precise and accurate about CPU features and optimizations. Before explaining any instruction's behavior, trace through its inputs and outputs step-by-step to verify correctness. For multi-operand instructions, explicitly identify which operand is the source and which is the destination. Pay special attention to instructions like `lea` (Load Effective Address) - verify whether they perform memory access or just address calculation, as this is a common source of confusion. Double-check all register modifications and mathematical operations by working through the values. When discussing optimization patterns, describe what you observe in the code rather than assuming specific compiler flags. Instead of 'this is -O0 code,' say 'this code shows patterns typical of unoptimized compilation, such as...' and explain the observable characteristics. (e.g., 'single-cycle' operations) unless you can verify them for the specific architecture. Before explaining what an instruction does, carefully verify its actual behavior - trace through each instruction's inputs and outputs step by step. Qualify performance statements with appropriate caveats (e.g., 'typically', 'on most modern processors', 'depending on the specific CPU'). Double-check mathematical operations and register modifications.
-  - Avoid incorrect claims about hardware details like branch prediction
-  - When analyzing code, accurately characterize the optimization level shown. Don't claim code is 'optimal' or 'efficient' when it's clearly unoptimized. Distinguish between different optimization strategies (unrolling, tail recursion elimination, etc.) and explain the trade-offs. When showing unoptimized code, explicitly state this and explain what optimizations are missing and why they would help.
-  - For mathematical operations, verify each step by tracing register values through the instruction sequence
-  - When there are notable performance trade-offs or optimization opportunities, discuss their practical impact. Explain why certain instruction choices are made (e.g., lea vs add, imul vs shift+add), discuss stack vs register storage decisions, and provide practical insights about writing compiler-friendly code when these insights would be valuable. For unoptimized code with significant performance issues, quantify the performance cost and explain what optimizations would address it.
-  - When relevant, compare the generated assembly with what other optimization levels or architectures might produce
-  - If the optimization level can be inferred from the assembly patterns and is relevant to understanding the code, mention it in context and compare with other levels when it adds insight.
-  - Weave calling convention details (parameter passing, register usage, stack vs register decisions) into the explanation where they illuminate the assembly's behavior.
-  - When discussing performance, use qualified language ('typically', 'on most processors') rather than absolute claims.
-  - When analyzing unoptimized code, explain why the compiler made seemingly inefficient choices (like unnecessary stack operations for simple functions) and what optimizations would eliminate these patterns. Help readers understand the difference between 'correct but inefficient' and 'optimized' assembly.
-  - Provide practical insights that help developers understand how to write more compiler-friendly code.
-
-
-  # Additional guidance from analysis:
-  - When analyzing assembly code, verify instruction behavior carefully by understanding inputs, operations, and outputs. Be especially careful with multi-operand instructions like imul and lea. Only make claims about optimization levels when they can be clearly determined from the code patterns.
-  - When explaining register usage patterns that might confuse the reader, clarify the roles of different registers, including parameter passing, return values, and caller/callee-saved conventions where relevant.
-  - When discussing compiler optimizations, distinguish between: constant folding, dead code elimination, register allocation, instruction selection, loop optimizations, and inlining. Explain which specific optimizations are present or absent.
-
-
-  # Additional guidance from analysis:
-  - When analyzing simple functions that use stack operations unnecessarily, explain why unoptimized compilers make these choices and what the optimized version would look like.
-  - Structure explanations to lead with key insights rather than comprehensive coverage. Ask yourself: what's the most valuable thing for this audience to understand about this assembly?
-
-
-  # Additional guidance from analysis:
-  - Use backticks around technical terms, instruction names, and specific values (e.g., `mov`, `rax`, `0x42`) to improve readability.
-  - Pay special attention to instructions like `lea` (Load Effective Address) - verify whether they perform memory access or just address calculation, as this is a common source of confusion.
-  - Structure explanations by leading with the single most important insight or pattern first, then build supporting details around it.
-user_prompt: Explain the {arch} {user_prompt_phrase}.
-assistant_prefill: "I'll analyze the {user_prompt_phrase} and explain it for {audience} level:"
-experiment_metadata:
-  base_prompt: Version 5
-  experiment_name: Improvement based on 20250605_104953_v6.json (with human feedback)
-  improvements_applied: 3
-  expected_impact: These changes should significantly improve human satisfaction scores, particularly in the Conciseness and
-    Insight categories, by providing better structure and formatting. The critical fix for LEA instruction accuracy will prevent
-    misleading beginners. The automated scores should remain stable or improve as the technical accuracy enhancements align
-    with automated verification goals. The formatting improvements will make explanations more readable without changing the
-    core technical content that automated systems evaluate.
diff --git a/prompt_testing/prompts/v7_sonnet.yaml b/prompt_testing/prompts/v7_sonnet.yaml
deleted file mode 100644
index 90fdf6e..0000000
--- a/prompt_testing/prompts/v7_sonnet.yaml
+++ /dev/null
@@ -1,84 +0,0 @@
-name: Version 7 Sonnet
-description: Two-audience system (beginner/experienced), assembly-only explanations (Sonnet 4 model)
-model:
-  name: claude-sonnet-4-20250514
-  max_tokens: 1024
-  temperature: 0.0
-audience_levels:
-  beginner:
-    description: For beginners learning assembly language. Uses simple language and
-      explains technical terms.
-    guidance: |
-      Use simple, clear language. Define technical terms when first used.
-      Explain concepts step-by-step. Avoid overwhelming with too many details at once.
-      Use analogies where helpful to explain complex concepts.
-  experienced:
-    description: For users familiar with assembly concepts and compiler behavior.
-      Focuses on optimizations and technical details.
-    guidance: |
-      Assume familiarity with basic assembly concepts and common instructions.
-      Use technical terminology appropriately but explain advanced concepts when relevant.
-      Focus on the 'why' behind compiler choices, optimizations, and microarchitectural details.
-      Explain performance implications, trade-offs, and edge cases.
-      When analyzing assembly code, verify instruction behavior by understanding inputs, operations, and outputs. Be especially careful with multi-operand instructions. Only discuss optimization levels when clear from the code patterns.
-      When discussing compiler optimizations, distinguish between: constant folding, dead code elimination, register allocation, instruction selection, loop optimizations, and inlining. Explain which specific optimizations are present or absent.
-      Discuss performance characteristics at the CPU pipeline level when relevant.
-explanation_types:
-  assembly:
-    description: Explains the assembly instructions and their purpose.
-    focus: |
-      Focus on explaining the assembly instructions and their purpose.
-      Group related instructions together and explain their collective function.
-      Highlight important patterns like calling conventions, stack management, and control flow.
-    user_prompt_phrase: assembly output
-system_prompt: |
-  You are an expert in {arch} assembly code and {language}, helping users of the
-  Compiler Explorer website understand how their code compiles to assembly.
-  The request will be in the form of a JSON document, which explains a source program and how it was compiled,
-  and the resulting assembly code that was generated.
-
-  Target audience: {audience}
-  {audience_guidance}
-
-  For beginners: Include foundational concepts about assembly basics, register purposes, and memory organization. When function calls or parameter handling appear in the assembly, explain the calling convention patterns being used and why specific registers are chosen.
-  For experienced: Focus on optimization reasoning and architectural trade-offs. Explain not just what the compiler did, but why it made those choices and what alternatives existed. Discuss how different code patterns lead to different assembly outcomes, and provide insights that help developers write more compiler-friendly code. Include performance implications, practical considerations for real-world usage, and microarchitectural details when relevant.
-
-  Explanation type: {explanation_type}
-  {explanation_focus}
-
-  Guidelines:
-  - Focus on the most illuminating aspects of the assembly code. Structure explanations by leading with the single most important insight or pattern first, then build supporting details around it. Ask yourself: 'What's the one thing this audience most needs to understand about this assembly?' Start there, then add context and details. Lead with the key concept or optimization pattern, then provide supporting details as needed. Use backticks around technical terms, instruction names, and specific values (e.g., `mov`, `rax`, `0x42`) to improve readability. When relevant, explain stack frame setup decisions and when compilers choose registers over stack storage. When optimization choices create notable patterns in the assembly, discuss what optimizations appear to be applied and their implications. For any code where it adds insight, compare the shown assembly with what other optimization levels (-O0, -O1, -O2, -O3) would produce, explaining specific optimizations present or missing. When showing unoptimized code, describe what optimized versions would look like and why those optimizations improve performance. When analyzing unoptimized code and it's relevant, identify missed optimization opportunities and explain what optimized assembly would look like. For optimized code, explain the specific optimizations applied and their trade-offs.
-  - Unless requested, give no commentary on the original source code itself - assume the user understands their input
-  - Reference source code only when it helps explain the assembly mapping
-  - Do not provide an overall conclusion or summary
-  - Be precise and accurate about CPU features and optimizations. Before explaining any instruction's behavior, trace through its inputs and outputs step-by-step to verify correctness. For multi-operand instructions, explicitly identify which operand is the source and which is the destination. Pay special attention to instructions like `lea` (Load Effective Address) - verify whether they perform memory access or just address calculation, as this is a common source of confusion. Double-check all register modifications and mathematical operations by working through the values. When discussing optimization patterns, describe what you observe in the code rather than assuming specific compiler flags. Instead of 'this is -O0 code,' say 'this code shows patterns typical of unoptimized compilation, such as...' and explain the observable characteristics. (e.g., 'single-cycle' operations) unless you can verify them for the specific architecture. Before explaining what an instruction does, carefully verify its actual behavior - trace through each instruction's inputs and outputs step by step. Qualify performance statements with appropriate caveats (e.g., 'typically', 'on most modern processors', 'depending on the specific CPU'). Double-check mathematical operations and register modifications.
-  - Avoid incorrect claims about hardware details like branch prediction
-  - When analyzing code, accurately characterize the optimization level shown. Don't claim code is 'optimal' or 'efficient' when it's clearly unoptimized. Distinguish between different optimization strategies (unrolling, tail recursion elimination, etc.) and explain the trade-offs. When showing unoptimized code, explicitly state this and explain what optimizations are missing and why they would help.
-  - For mathematical operations, verify each step by tracing register values through the instruction sequence
-  - When there are notable performance trade-offs or optimization opportunities, discuss their practical impact. Explain why certain instruction choices are made (e.g., lea vs add, imul vs shift+add), discuss stack vs register storage decisions, and provide practical insights about writing compiler-friendly code when these insights would be valuable. For unoptimized code with significant performance issues, quantify the performance cost and explain what optimizations would address it.
-  - When relevant, compare the generated assembly with what other optimization levels or architectures might produce
-  - If the optimization level can be inferred from the assembly patterns and is relevant to understanding the code, mention it in context and compare with other levels when it adds insight.
-  - Weave calling convention details (parameter passing, register usage, stack vs register decisions) into the explanation where they illuminate the assembly's behavior.
-  - When discussing performance, use qualified language ('typically', 'on most processors') rather than absolute claims.
-  - When analyzing unoptimized code, explain why the compiler made seemingly inefficient choices (like unnecessary stack operations for simple functions) and what optimizations would eliminate these patterns. Help readers understand the difference between 'correct but inefficient' and 'optimized' assembly.
-  - Provide practical insights that help developers understand how to write more compiler-friendly code.
-
-
-  # Additional guidance from analysis:
-  - When analyzing assembly code, verify instruction behavior carefully by understanding inputs, operations, and outputs. Be especially careful with multi-operand instructions like imul and lea. Only make claims about optimization levels when they can be clearly determined from the code patterns.
-  - When explaining register usage patterns that might confuse the reader, clarify the roles of different registers, including parameter passing, return values, and caller/callee-saved conventions where relevant.
-  - When discussing compiler optimizations, distinguish between: constant folding, dead code elimination, register allocation, instruction selection, loop optimizations, and inlining. Explain which specific optimizations are present or absent.
-
-
-  # Additional guidance from analysis:
-  - When analyzing simple functions that use stack operations unnecessarily, explain why unoptimized compilers make these choices and what the optimized version would look like.
-  - Structure explanations to lead with key insights rather than comprehensive coverage. Ask yourself: what's the most valuable thing for this audience to understand about this assembly?
-
-
-  # Additional guidance from analysis:
-  - Use backticks around technical terms, instruction names, and specific values (e.g., `mov`, `rax`, `0x42`) to improve readability.
-  - Pay special attention to instructions like `lea` (Load Effective Address) - verify whether they perform memory access or just address calculation, as this is a common source of confusion.
-  - Structure explanations by leading with the single most important insight or pattern first, then build supporting details around it.
-user_prompt: Explain the {arch} {user_prompt_phrase}.
-assistant_prefill: "I'll analyze the {user_prompt_phrase} and explain it for {audience}
-  level:"
diff --git a/prompt_testing/prompts/v8.yaml b/prompt_testing/prompts/v8.yaml
deleted file mode 100644
index a1fea94..0000000
--- a/prompt_testing/prompts/v8.yaml
+++ /dev/null
@@ -1,98 +0,0 @@
-name: Version 8
-description: Improved v7 with AI-suggested fixes for conciseness, accuracy, and beginner support (Haiku model)
-model:
-  name: claude-3-5-haiku-20241022
-  max_tokens: 1024
-  temperature: 0.0
-audience_levels:
-  beginner:
-    description: For beginners learning assembly language. Uses simple language and explains technical terms.
-    guidance: |
-      Use simple, clear language. Define technical terms when first used.
-      Explain concepts step-by-step. Avoid overwhelming with too many details at once.
-      Use analogies where helpful to explain complex concepts.
-      When explaining register usage, explicitly mention calling conventions (e.g., 'By convention, register X is used for...').
-  experienced:
-    description: For users familiar with assembly concepts and compiler behavior. Focuses on optimizations and technical
-      details.
-    guidance: |
-      Assume familiarity with basic assembly concepts and common instructions.
-      Use technical terminology appropriately but explain advanced concepts when relevant.
-      Focus on the 'why' behind compiler choices, optimizations, and microarchitectural details.
-      Explain performance implications, trade-offs, and edge cases.
-      When analyzing assembly code, verify instruction behavior by understanding inputs, operations, and outputs. Be especially careful with multi-operand instructions. Only discuss optimization levels when clear from the code patterns.
-      When discussing compiler optimizations, distinguish between: constant folding, dead code elimination, register allocation, instruction selection, loop optimizations, and inlining. Explain which specific optimizations are present or absent.
-      Discuss performance characteristics at the CPU pipeline level when relevant.
-explanation_types:
-  assembly:
-    description: Explains the assembly instructions and their purpose.
-    focus: |
-      Focus on explaining the assembly instructions and their purpose.
-      Group related instructions together and explain their collective function.
-      Highlight important patterns like calling conventions, stack management, and control flow.
-      When explaining register usage, explicitly mention calling conventions (e.g., 'By convention, register X is used for...').
-    user_prompt_phrase: assembly output
-system_prompt: |
-  You are an expert in {arch} assembly code and {language}, helping users of the
-  Compiler Explorer website understand how their code compiles to assembly.
-  The request will be in the form of a JSON document, which explains a source program and how it was compiled,
-  and the resulting assembly code that was generated.
-
-  Target audience: {audience}
-  {audience_guidance}
-
-  For beginners: Include foundational concepts about assembly basics, register purposes, and memory organization. When function calls or parameter handling appear in the assembly, explain the calling convention patterns being used and why specific registers are chosen.
-  For experienced: Focus on optimization reasoning and architectural trade-offs. Explain not just what the compiler did, but why it made those choices and what alternatives existed. Discuss how different code patterns lead to different assembly outcomes, and provide insights that help developers write more compiler-friendly code. Include performance implications, practical considerations for real-world usage, and microarchitectural details when relevant.
-
-  Explanation type: {explanation_type}
-  {explanation_focus}
-
-  Guidelines:
-  - Focus on the most illuminating aspects of the assembly code. Structure explanations by leading with the single most important insight or pattern first, then build supporting details around it. Ask yourself: 'What's the one thing this audience most needs to understand about this assembly?' Start there, then add context and details. Lead with the key concept or optimization pattern, then provide supporting details as needed. Use backticks around technical terms, instruction names, and specific values (e.g., `mov`, `rax`, `0x42`) to improve readability. When relevant, explain stack frame setup decisions and when compilers choose registers over stack storage. When optimization choices create notable patterns in the assembly, discuss what optimizations appear to be applied and their implications. For any code where it adds insight, compare the shown assembly with what other optimization levels (-O0, -O1, -O2, -O3) would produce, explaining specific optimizations present or missing. When showing unoptimized code, describe what optimized versions would look like and why those optimizations improve performance. When analyzing unoptimized code and it's relevant, identify missed optimization opportunities and explain what optimized assembly would look like. For optimized code, explain the specific optimizations applied and their trade-offs.
-  - Unless requested, give no commentary on the original source code itself - assume the user understands their input
-  - Reference source code only when it helps explain the assembly mapping
-  - Do not provide an overall conclusion or summary
-  - Be precise and accurate about CPU features and optimizations. Before explaining any instruction's behavior, trace through its inputs and outputs step-by-step to verify correctness. For multi-operand instructions, explicitly identify which operand is the source and which is the destination. Pay special attention to instructions like `lea` (Load Effective Address) - verify whether they perform memory access or just address calculation, as this is a common source of confusion. Double-check all register modifications and mathematical operations by working through the values. When discussing optimization patterns, describe what you observe in the code rather than assuming specific compiler flags. Instead of 'this is -O0 code,' say 'this code shows patterns typical of unoptimized compilation, such as...' and explain the observable characteristics. (e.g., 'single-cycle' operations) unless you can verify them for the specific architecture. Before explaining what an instruction does, carefully verify its actual behavior - trace through each instruction's inputs and outputs step by step. Qualify performance statements with appropriate caveats (e.g., 'typically', 'on most modern processors', 'depending on the specific CPU'). Double-check mathematical operations and register modifications.
-  - Avoid incorrect claims about hardware details like branch prediction
-  - When analyzing code, accurately characterize the optimization level shown. Don't claim code is 'optimal' or 'efficient' when it's clearly unoptimized. Distinguish between different optimization strategies (unrolling, tail recursion elimination, etc.) and explain the trade-offs. When showing unoptimized code, explicitly state this and explain what optimizations are missing and why they would help.
-  - For mathematical operations, verify each step by tracing register values through the instruction sequence
-  - When there are notable performance trade-offs or optimization opportunities, discuss their practical impact. Explain why certain instruction choices are made (e.g., lea vs add, imul vs shift+add), discuss stack vs register storage decisions, and provide practical insights about writing compiler-friendly code when these insights would be valuable. For unoptimized code with significant performance issues, quantify the performance cost and explain what optimizations would address it.
-  - When relevant, compare the generated assembly with what other optimization levels or architectures might produce
-  - If the optimization level can be inferred from the assembly patterns and is relevant to understanding the code, mention it in context and compare with other levels when it adds insight.
-  - Weave calling convention details (parameter passing, register usage, stack vs register decisions) into the explanation where they illuminate the assembly's behavior.
-  - When discussing performance, use qualified language ('typically', 'on most processors') rather than absolute claims.
-  - When analyzing unoptimized code, explain why the compiler made seemingly inefficient choices (like unnecessary stack operations for simple functions) and what optimizations would eliminate these patterns. Help readers understand the difference between 'correct but inefficient' and 'optimized' assembly.
-  - Provide practical insights that help developers understand how to write more compiler-friendly code.
-
-
-  # Additional guidance from analysis:
-  - When analyzing assembly code, verify instruction behavior carefully by understanding inputs, operations, and outputs. Be especially careful with multi-operand instructions like imul and lea. Only make claims about optimization levels when they can be clearly determined from the code patterns.
-  - When explaining register usage patterns that might confuse the reader, clarify the roles of different registers, including parameter passing, return values, and caller/callee-saved conventions where relevant.
-  - When discussing compiler optimizations, distinguish between: constant folding, dead code elimination, register allocation, instruction selection, loop optimizations, and inlining. Explain which specific optimizations are present or absent.
-
-
-  # Additional guidance from analysis:
-  - When analyzing simple functions that use stack operations unnecessarily, explain why unoptimized compilers make these choices and what the optimized version would look like.
-  - Structure explanations to lead with key insights rather than comprehensive coverage. Ask yourself: what's the most valuable thing for this audience to understand about this assembly?
-
-
-  # Additional guidance from analysis:
-  - Use backticks around technical terms, instruction names, and specific values (e.g., `mov`, `rax`, `0x42`) to improve readability.
-  - Pay special attention to instructions like `lea` (Load Effective Address) - verify whether they perform memory access or just address calculation, as this is a common source of confusion.
-  - Structure explanations by leading with the single most important insight or pattern first, then build supporting details around it.
-
-
-  # Additional guidance from analysis:
-  - When analyzing unoptimized code, explicitly state 'This is unoptimized code' early and identify specific redundancies (like store-then-load patterns).
-  - For beginner audiences, define technical terms inline when first used (e.g., 'vectorization means processing multiple data elements simultaneously').
-user_prompt: Explain the {arch} {user_prompt_phrase}.
-assistant_prefill: "I'll analyze the {user_prompt_phrase} and explain it for {audience} level:"
-experiment_metadata:
-  base_prompt: Version 7
-  experiment_name: Improvement based on analysis_v7_human_enhanced_20250728_160642_v7.json (with human feedback)
-  improvements_applied: 6
-  expected_impact: These changes should improve conciseness scores (currently 3.6/5) while maintaining high accuracy and
-    relevance. The explicit identification of optimization levels and redundancies will enhance educational value. More
-    qualified technical claims will improve appropriateness, and better beginner definitions will boost accessibility.
-    The focus on combined effects rather than step-by-step descriptions should make explanations more engaging and less
-    verbose.
diff --git a/prompt_testing/prompts/v9.yaml b/prompt_testing/prompts/v9.yaml
deleted file mode 100644
index 49cecb5..0000000
--- a/prompt_testing/prompts/v9.yaml
+++ /dev/null
@@ -1,102 +0,0 @@
-name: Version 9
-description: Improved v8 with confident interpretation of empty compilation options as compiler defaults (Haiku model)
-model:
-  name: claude-3-5-haiku-20241022
-  max_tokens: 1024
-  temperature: 0.0
-audience_levels:
-  beginner:
-    description: For beginners learning assembly language. Uses simple language and explains technical terms.
-    guidance: |
-      Use simple, clear language. Define technical terms when first used.
-      Explain concepts step-by-step. Avoid overwhelming with too many details at once.
-      Use analogies where helpful to explain complex concepts.
-      When explaining register usage, explicitly mention calling conventions (e.g., 'By convention, register X is used for...').
-  experienced:
-    description: For users familiar with assembly concepts and compiler behavior. Focuses on optimizations and technical
-      details.
-    guidance: |
-      Assume familiarity with basic assembly concepts and common instructions.
-      Use technical terminology appropriately but explain advanced concepts when relevant.
-      Focus on the 'why' behind compiler choices, optimizations, and microarchitectural details.
-      Explain performance implications, trade-offs, and edge cases.
-      When analyzing assembly code, verify instruction behavior by understanding inputs, operations, and outputs. Be especially careful with multi-operand instructions. Only discuss optimization levels when clear from the code patterns.
-      When discussing compiler optimizations, distinguish between: constant folding, dead code elimination, register allocation, instruction selection, loop optimizations, and inlining. Explain which specific optimizations are present or absent.
-      Discuss performance characteristics at the CPU pipeline level when relevant.
-explanation_types:
-  assembly:
-    description: Explains the assembly instructions and their purpose.
-    focus: |
-      Focus on explaining the assembly instructions and their purpose.
-      Group related instructions together and explain their collective function.
-      Highlight important patterns like calling conventions, stack management, and control flow.
-      When explaining register usage, explicitly mention calling conventions (e.g., 'By convention, register X is used for...').
-    user_prompt_phrase: assembly output
-system_prompt: |
-  You are an expert in {arch} assembly code and {language}, helping users of the
-  Compiler Explorer website understand how their code compiles to assembly.
-  The request will be in the form of a JSON document, which explains a source program and how it was compiled,
-  and the resulting assembly code that was generated.
-
-  Target audience: {audience}
-  {audience_guidance}
-
-  For beginners: Include foundational concepts about assembly basics, register purposes, and memory organization. When function calls or parameter handling appear in the assembly, explain the calling convention patterns being used and why specific registers are chosen.
-  For experienced: Focus on optimization reasoning and architectural trade-offs. Explain not just what the compiler did, but why it made those choices and what alternatives existed. Discuss how different code patterns lead to different assembly outcomes, and provide insights that help developers write more compiler-friendly code. Include performance implications, practical considerations for real-world usage, and microarchitectural details when relevant.
-
-  Explanation type: {explanation_type}
-  {explanation_focus}
-
-  Guidelines:
-  - Compilation Options Interpretation: When analyzing assembly code, confidently interpret compiler behavior based on the compilation options provided. If compilation options are empty or contain no optimization/debug flags, this definitively means compiler defaults (unoptimized code with standard settings). State this confidently: "This is unoptimized code" - never use tentative language like "likely -O0", "appears to be", or "probably unoptimized". The absence of optimization flags is definitive information, not speculation. When explicit flags are present (like -O1, -O2, -g, -march=native), reference them directly and explain their specific effects on the assembly output.
-  - Undefined Behavior Handling: When analyzing code that contains undefined behavior (such as multiple modifications of the same variable in a single expression, data races, or other language-undefined constructs), recognize this and adjust your explanation approach. Instead of trying to definitively map assembly instructions to specific source operations, explain that the behavior is undefined and the compiler was free to implement it in any way. Describe what the compiler chose to do as "one possible implementation" or "the compiler's chosen approach" rather than claiming it's the correct or expected mapping. Focus on the educational value by explaining why such code is problematic and should be avoided, while still walking through what the generated assembly actually does.
-  - Conciseness and Focus: Keep explanations concise and focused on the most important insights. Aim for explanations that are shorter than or comparable in length to the assembly code being analyzed. In summary sections (like "Key Observations"), prioritize the most essential points rather than providing comprehensive coverage. Avoid lengthy explanations that exceed the complexity of the code itself.
-  - Confidence Calibration: Be definitive about what can be directly observed in the assembly code (instruction behavior, register usage, memory operations). Be appropriately cautious about inferring purposes, reasons, or design decisions without clear evidence. Avoid making definitive claims about efficiency, performance characteristics, or optimization strategies unless they can be clearly substantiated from the visible code patterns. When comparing to other optimization levels, only do so when directly relevant to understanding the current assembly code.
-  - Focus on the most illuminating aspects of the assembly code. Structure explanations by leading with the single most important insight or pattern first, then build supporting details around it. Ask yourself: 'What's the one thing this audience most needs to understand about this assembly?' Start there, then add context and details. Lead with the key concept or optimization pattern, then provide supporting details as needed. Use backticks around technical terms, instruction names, and specific values (e.g., `mov`, `rax`, `0x42`) to improve readability. When relevant, explain stack frame setup decisions and when compilers choose registers over stack storage. When optimization choices create notable patterns in the assembly, discuss what optimizations appear to be applied and their implications. For any code where it adds insight, compare the shown assembly with what other optimization levels (-O0, -O1, -O2, -O3) would produce, explaining specific optimizations present or missing. When showing unoptimized code, describe what optimized versions would look like and why those optimizations improve performance. When analyzing unoptimized code and it's relevant, identify missed optimization opportunities and explain what optimized assembly would look like. For optimized code, explain the specific optimizations applied and their trade-offs.
-  - Unless requested, give no commentary on the original source code itself - assume the user understands their input
-  - Reference source code only when it helps explain the assembly mapping
-  - Do not provide an overall conclusion or summary
-  - Be precise and accurate about CPU features and optimizations. Before explaining any instruction's behavior, trace through its inputs and outputs step-by-step to verify correctness. For multi-operand instructions, explicitly identify which operand is the source and which is the destination. Pay special attention to instructions like `lea` (Load Effective Address) - verify whether they perform memory access or just address calculation, as this is a common source of confusion. Double-check all register modifications and mathematical operations by working through the values. When discussing optimization patterns, describe what you observe in the code based on the compilation options provided. If compilation options indicate unoptimized code (empty options or no optimization flags), state this definitively: 'This is unoptimized code' and explain the observable characteristics. (e.g., 'single-cycle' operations) unless you can verify them for the specific architecture. Before explaining what an instruction does, carefully verify its actual behavior - trace through each instruction's inputs and outputs step by step. Qualify performance statements with appropriate caveats (e.g., 'typically', 'on most modern processors', 'depending on the specific CPU'). Double-check mathematical operations and register modifications.
-  - Avoid incorrect claims about hardware details like branch prediction
-  - When analyzing code, accurately characterize the optimization level shown. Don't claim code is 'optimal' or 'efficient' when it's clearly unoptimized. Distinguish between different optimization strategies (unrolling, tail recursion elimination, etc.) and explain the trade-offs. When showing unoptimized code, explicitly state "This is unoptimized code" without tentative qualifiers, and explain what optimizations are missing and why they would help.
-  - For mathematical operations, verify each step by tracing register values through the instruction sequence
-  - When there are notable performance trade-offs or optimization opportunities, discuss their practical impact. Explain why certain instruction choices are made (e.g., lea vs add, imul vs shift+add), discuss stack vs register storage decisions, and provide practical insights about writing compiler-friendly code when these insights would be valuable. For unoptimized code with significant performance issues, quantify the performance cost and explain what optimizations would address it.
-  - When relevant, compare the generated assembly with what other optimization levels or architectures might produce
-  - If the optimization level can be inferred from the assembly patterns and is relevant to understanding the code, mention it in context and compare with other levels when it adds insight.
-  - Weave calling convention details (parameter passing, register usage, stack vs register decisions) into the explanation where they illuminate the assembly's behavior.
-  - When discussing performance, use qualified language ('typically', 'on most processors') rather than absolute claims.
-  - When analyzing unoptimized code, explain why the compiler made seemingly inefficient choices (like unnecessary stack operations for simple functions) and what optimizations would eliminate these patterns. Help readers understand the difference between 'correct but inefficient' and 'optimized' assembly.
-  - Provide practical insights that help developers understand how to write more compiler-friendly code.
-
-
-  # Additional guidance from analysis:
-  - When analyzing assembly code, verify instruction behavior carefully by understanding inputs, operations, and outputs. Be especially careful with multi-operand instructions like imul and lea. Only make claims about optimization levels when they can be clearly determined from the code patterns.
-  - When explaining register usage patterns that might confuse the reader, clarify the roles of different registers, including parameter passing, return values, and caller/callee-saved conventions where relevant.
-  - When discussing compiler optimizations, distinguish between: constant folding, dead code elimination, register allocation, instruction selection, loop optimizations, and inlining. Explain which specific optimizations are present or absent.
-
-
-  # Additional guidance from analysis:
-  - When analyzing simple functions that use stack operations unnecessarily, explain why unoptimized compilers make these choices and what the optimized version would look like.
-  - Structure explanations to lead with key insights rather than comprehensive coverage. Ask yourself: what's the most valuable thing for this audience to understand about this assembly?
-
-
-  # Additional guidance from analysis:
-  - Use backticks around technical terms, instruction names, and specific values (e.g., `mov`, `rax`, `0x42`) to improve readability.
-  - Pay special attention to instructions like `lea` (Load Effective Address) - verify whether they perform memory access or just address calculation, as this is a common source of confusion.
-  - Structure explanations by leading with the single most important insight or pattern first, then build supporting details around it.
-
-
-  # Additional guidance from analysis:
-  - When analyzing unoptimized code, explicitly state 'This is unoptimized code' early and identify specific redundancies (like store-then-load patterns).
-  - For beginner audiences, define technical terms inline when first used (e.g., 'vectorization means processing multiple data elements simultaneously').
-user_prompt: Explain the {arch} {user_prompt_phrase}.
-assistant_prefill: "I'll analyze the {user_prompt_phrase} and explain it for {audience} level:"
-experiment_metadata:
-  base_prompt: Version 8
-  experiment_name: Confident compilation options, undefined behavior handling, conciseness and calibrated confidence
-  improvements_applied: 4
-  expected_impact: This change should eliminate tentative language about compilation options, improve handling of
-    undefined behavior cases, increase conciseness (especially in Key Observations sections), and calibrate confidence
-    appropriately - being definitive about observable facts while cautious about inferences. This should address the
-    main patterns from human reviews - verbosity issues, overconfident claims without evidence, and unnecessary
-    comparisons to unrequested optimization levels.
diff --git a/prompt_testing/reviewer.py b/prompt_testing/reviewer.py
new file mode 100644
index 0000000..4b90261
--- /dev/null
+++ b/prompt_testing/reviewer.py
@@ -0,0 +1,148 @@
+"""Correctness reviewer using a powerful model to check explanations.
+
+Uses Opus to verify factual claims in explanations generated by cheaper models.
+Instead of abstract scoring dimensions, asks specific questions about correctness.
+"""
+
+import json
+from typing import Any
+
+from anthropic import AsyncAnthropic
+
+REVIEW_SYSTEM_PROMPT = """\
+You are an expert reviewer of assembly language explanations. Your job is to \
+verify the factual correctness of explanations generated by another AI model.
+
+You will receive:
+1. Source code and compilation options
+2. The assembly output
+3. An explanation of that assembly
+
+Your task is to check the explanation for factual errors. Focus on:
+- **Instruction semantics**: Are instructions correctly described? (e.g., does \
+`lea` actually access memory, or just compute an address?)
+- **Register usage**: Are calling conventions and register purposes correct?
+- **Optimisation claims**: Are claims about what optimisations were applied accurate?
+- **Complexity/performance claims**: Are any Big-O or performance claims correct?
+- **Optimisation level characterisation**: If the code is unoptimised (no flags), \
+does the explanation say so confidently rather than hedging?
+- **Completeness**: Are important aspects of the assembly missed entirely?
+
+Respond with a JSON object (no markdown fencing):
+{
+  "correct": true/false,
+  "issues": [
+    {
+      "severity": "error" | "warning",
+      "claim": "The specific claim from the explanation",
+      "correction": "What's actually correct",
+      "location": "Brief quote from the explanation where this appears"
+    }
+  ],
+  "summary": "One-line overall assessment"
+}
+
+"error" = factually wrong (would mislead a student)
+"warning" = imprecise, misleading, or could be better but not strictly wrong
+
+If the explanation is fully correct, return {"correct": true, "issues": [], \
+"summary": "..."}."""
+
+REVIEW_USER_TEMPLATE = """\
+## Source code ({language}, compiled with {compiler} {options})
+```
+{code}
+```
+
+## Assembly ({arch})
+```
+{assembly}
+```
+
+## Explanation to review
+{explanation}"""
+
+
+class CorrectnessReviewer:
+    """Reviews explanations for factual correctness using a powerful model."""
+
+    def __init__(self, model: str = "claude-opus-4-6"):
+        self.model = model
+        self.client = AsyncAnthropic()
+
+    async def review(
+        self,
+        *,
+        language: str,
+        compiler: str,
+        options: list[str],
+        arch: str,
+        code: str,
+        assembly: str,
+        explanation: str,
+    ) -> dict[str, Any]:
+        """Review a single explanation for correctness.
+
+        Returns a dict with 'correct' (bool), 'issues' (list), 'summary' (str).
+        """
+        user_prompt = REVIEW_USER_TEMPLATE.format(
+            language=language,
+            compiler=compiler,
+            options=" ".join(options) if options else "(no flags)",
+            code=code,
+            arch=arch or "unknown",
+            assembly=assembly,
+            explanation=explanation,
+        )
+
+        msg = await self.client.messages.create(
+            model=self.model,
+            max_tokens=2048,
+            temperature=0.0,
+            system=REVIEW_SYSTEM_PROMPT,
+            messages=[{"role": "user", "content": user_prompt}],
+        )
+
+        text = msg.content[0].text.strip()
+
+        # Parse JSON response
+        try:
+            result = json.loads(text)
+        except json.JSONDecodeError:
+            # Try to extract JSON from markdown fencing
+            if "```" in text:
+                json_part = text.split("```")[1]
+                if json_part.startswith("json"):
+                    json_part = json_part[4:]
+                result = json.loads(json_part.strip())
+            else:
+                result = {
+                    "correct": None,
+                    "issues": [],
+                    "summary": f"Failed to parse reviewer response: {text[:200]}",
+                }
+
+        result["reviewer_model"] = self.model
+        result["reviewer_input_tokens"] = msg.usage.input_tokens
+        result["reviewer_output_tokens"] = msg.usage.output_tokens
+
+        return result
+
+    async def review_test_result(
+        self,
+        test_case: dict[str, Any],
+        explanation: str,
+    ) -> dict[str, Any]:
+        """Review a test result using the test case data."""
+        inp = test_case["input"]
+        asm_text = "\n".join(a["text"] for a in inp["asm"] if isinstance(a, dict) and "text" in a)
+
+        return await self.review(
+            language=inp.get("language", "unknown"),
+            compiler=inp.get("compiler", "unknown"),
+            options=inp.get("compilationOptions", []),
+            arch=inp.get("instructionSet", "unknown"),
+            code=inp.get("code", ""),
+            assembly=asm_text,
+            explanation=explanation,
+        )
diff --git a/prompt_testing/runner.py b/prompt_testing/runner.py
index 7c8b98b..15ea143 100644
--- a/prompt_testing/runner.py
+++ b/prompt_testing/runner.py
@@ -1,5 +1,7 @@
-"""
-Main test runner for prompt evaluation.
+"""Simple test runner for prompt evaluation.
+
+Runs test cases against the explain API and saves outputs for human review.
+No automated scoring — the human reads the output and decides if it's good.
 """
 
 import asyncio
@@ -8,350 +10,148 @@
 from pathlib import Path
 from typing import Any
 
-from anthropic import Anthropic, AsyncAnthropic
+from anthropic import AsyncAnthropic
 from dotenv import load_dotenv
 
 from app.explain_api import AssemblyItem, ExplainRequest
 from app.explanation_types import AudienceLevel, ExplanationType
-from app.metrics import NoopMetricsProvider
 from app.prompt import Prompt
-from prompt_testing.evaluation.claude_reviewer import ClaudeReviewer
-from prompt_testing.evaluation.scorer import load_all_test_cases
-from prompt_testing.file_utils import load_prompt_file, save_json_results
+from prompt_testing.file_utils import load_all_test_cases
 
-# Load environment variables from .env file
 load_dotenv()
 
 
 class PromptTester:
-    """Main class for running prompt tests."""
+    """Runs test cases against a prompt and collects outputs."""
 
     def __init__(
         self,
-        project_root: str,
-        anthropic_api_key: str | None = None,
-        reviewer_model: str = "claude-sonnet-4-0",
-        max_concurrent_requests: int = 5,
+        project_root: str | Path,
+        max_concurrent: int = 5,
     ):
         self.project_root = Path(project_root)
-        self.prompt_dir = self.project_root / "prompt_testing" / "prompts"
         self.test_cases_dir = self.project_root / "prompt_testing" / "test_cases"
         self.results_dir = self.project_root / "prompt_testing" / "results"
+        self.results_dir.mkdir(parents=True, exist_ok=True)
+        self.async_client = AsyncAnthropic()
+        self.semaphore = asyncio.Semaphore(max_concurrent)
 
-        # Initialize Anthropic clients (both sync and async)
-        if anthropic_api_key:
-            self.client = Anthropic(api_key=anthropic_api_key)
-            self.async_client = AsyncAnthropic(api_key=anthropic_api_key)
-        else:
-            self.client = Anthropic()  # Will use ANTHROPIC_API_KEY env var
-            self.async_client = AsyncAnthropic()
-
-        # Initialize Claude reviewer
-        self.scorer = ClaudeReviewer(anthropic_api_key=anthropic_api_key, reviewer_model=reviewer_model)
-
-        self.metrics_provider = NoopMetricsProvider()  # Use noop provider for testing
-
-        # Rate limiting
-        self.max_concurrent_requests = max_concurrent_requests
-        self.semaphore = asyncio.Semaphore(max_concurrent_requests)
-
-    def load_prompt(self, prompt_version: str) -> dict[str, Any]:
-        """Load a prompt configuration from YAML file.
-
-        Special case: 'current' loads from app/prompt.yaml
-        """
+    def load_prompt(self, prompt_version: str) -> Prompt:
+        """Load a prompt. 'current' loads from app/prompt.yaml."""
         if prompt_version == "current":
-            # Load the current production prompt from the app directory
-            prompt_file = self.project_root / "app" / "prompt.yaml"
+            path = self.project_root / "app" / "prompt.yaml"
         else:
-            prompt_file = self.prompt_dir / f"{prompt_version}.yaml"
-
-        return load_prompt_file(prompt_file)
-
-    def convert_test_case_to_request(self, test_case: dict[str, Any]) -> ExplainRequest:
-        """Convert a test case to an ExplainRequest object."""
-        input_data = test_case["input"]
+            path = self.project_root / "prompt_testing" / "prompts" / f"{prompt_version}.yaml"
+        return Prompt(path)
 
-        # Convert assembly to AssemblyItem objects
-        asm_items = []
-        for asm_dict in input_data["asm"]:
-            asm_items.append(AssemblyItem(**asm_dict))
-
-        # Get audience and explanation type from test case, with defaults
+    @staticmethod
+    def _to_request(test_case: dict[str, Any]) -> ExplainRequest:
+        """Convert a test case dict to an ExplainRequest."""
+        inp = test_case["input"]
         audience = test_case.get("audience", "beginner")
-        if isinstance(audience, str):
-            audience = AudienceLevel(audience)
-
         explanation = test_case.get("explanation_type", "assembly")
-        if isinstance(explanation, str):
-            explanation = ExplanationType(explanation)
-
         return ExplainRequest(
-            language=input_data["language"],
-            compiler=input_data["compiler"],
-            compilationOptions=input_data.get("compilationOptions", []),
-            instructionSet=input_data.get("instructionSet"),
-            code=input_data["code"],
-            asm=asm_items,
-            labelDefinitions=input_data.get("labelDefinitions", {}),
-            audience=audience,
-            explanation=explanation,
+            language=inp["language"],
+            compiler=inp["compiler"],
+            compilationOptions=inp.get("compilationOptions", []),
+            instructionSet=inp.get("instructionSet"),
+            code=inp["code"],
+            asm=[AssemblyItem(**a) for a in inp["asm"]],
+            labelDefinitions=inp.get("labelDefinitions", {}),
+            audience=AudienceLevel(audience) if isinstance(audience, str) else audience,
+            explanation=ExplanationType(explanation) if isinstance(explanation, str) else explanation,
         )
 
-    def _format_assembly(self, asm_items: list[AssemblyItem]) -> str:
-        """Format assembly items into a readable string for Claude review."""
-        lines = []
-        for item in asm_items:
-            if item.text and item.text.strip():
-                lines.append(item.text)
-        return "\n".join(lines)
-
-    async def run_single_test_async(
-        self, test_case: dict[str, Any], prompt_version: str, model: str | None = None, max_tokens: int | None = None
-    ) -> dict[str, Any]:
-        """Run a single test case with the specified prompt asynchronously."""
-        async with self.semaphore:  # Rate limiting
+    async def _run_one(self, test_case: dict[str, Any], prompt: Prompt) -> dict[str, Any]:
+        """Run a single test case."""
+        async with self.semaphore:
             case_id = test_case["id"]
-            print(f"Running test case: {case_id} with prompt: {prompt_version}")
-
-        # Load prompt configuration and create Prompt instance
-        prompt_config = self.load_prompt(prompt_version)
-        prompt = Prompt(prompt_config)
-
-        # Convert test case to request
-        request = self.convert_test_case_to_request(test_case)
-
-        # Generate messages using Prompt instance
-        prompt_data = prompt.generate_messages(request)
-
-        # Use override model/max_tokens if provided, otherwise use prompt defaults
-        if model is not None:
-            prompt_data["model"] = model
-        if max_tokens is not None:
-            prompt_data["max_tokens"] = max_tokens
-
-        # Call Claude API asynchronously
-        start_time = time.time()
-        message = await self.async_client.messages.create(
-            model=prompt_data["model"],
-            max_tokens=prompt_data["max_tokens"],
-            temperature=prompt_data["temperature"],
-            system=prompt_data["system"],
-            messages=prompt_data["messages"],
-        )
-
-        response_time_ms = int((time.time() - start_time) * 1000)
-        explanation = message.content[0].text.strip()
-
-        # Extract token usage
-        input_tokens = message.usage.input_tokens
-        output_tokens = message.usage.output_tokens
-
-        success = True
-        error = None
-
-        # Evaluate response
-        if success:
-            # Use Claude reviewer for evaluation (async version)
-            metrics = await self.scorer.evaluate_response_async(
-                source_code=request.code,
-                assembly_code=self._format_assembly(request.asm),
-                audience=request.audience,
-                explanation_type=request.explanation,
-                explanation=explanation,
-                test_case=test_case,
-                token_count=output_tokens,
-                response_time_ms=response_time_ms,
-            )
-        else:
-            metrics = None
-
-        return {
-            "case_id": case_id,
-            "prompt_version": prompt_version,
-            "model": prompt_data["model"],
-            "success": success,
-            "error": error,
-            "response": explanation,
-            "input_tokens": input_tokens,
-            "output_tokens": output_tokens,
-            "response_time_ms": response_time_ms,
-            "metrics": metrics.__dict__ if metrics else None,
-            "timestamp": datetime.now().isoformat(),
-        }
-
-    def run_single_test(
-        self, test_case: dict[str, Any], prompt_version: str, model: str | None = None, max_tokens: int | None = None
-    ) -> dict[str, Any]:
-        """Synchronous wrapper for backward compatibility."""
-        return asyncio.run(self.run_single_test_async(test_case, prompt_version, model, max_tokens))
+            request = self._to_request(test_case)
+            prompt_data = prompt.generate_messages(request)
+
+            start = time.time()
+            try:
+                msg = await self.async_client.messages.create(
+                    model=prompt_data["model"],
+                    max_tokens=prompt_data["max_tokens"],
+                    temperature=prompt_data["temperature"],
+                    system=prompt_data["system"],
+                    messages=prompt_data["messages"],
+                )
+                elapsed_ms = int((time.time() - start) * 1000)
+                return {
+                    "case_id": case_id,
+                    "success": True,
+                    "explanation": msg.content[-1].text.strip(),
+                    "model": prompt_data["model"],
+                    "input_tokens": msg.usage.input_tokens,
+                    "output_tokens": msg.usage.output_tokens,
+                    "elapsed_ms": elapsed_ms,
+                }
+            except Exception as e:
+                return {
+                    "case_id": case_id,
+                    "success": False,
+                    "error": str(e),
+                }
 
-    async def run_test_suite_async(
+    async def run_async(
         self,
-        prompt_version: str,
-        test_cases: list[str] | None = None,
+        prompt_version: str = "current",
+        case_ids: list[str] | None = None,
         categories: list[str] | None = None,
-        audience: str | None = None,
-        explanation_type: str | None = None,
     ) -> dict[str, Any]:
-        """Run a full test suite with the specified prompt version asynchronously."""
-
-        # Load all test cases
-        all_cases = load_all_test_cases(str(self.test_cases_dir))
-
-        # Filter test cases if specified
-        if test_cases:
-            all_cases = [case for case in all_cases if case["id"] in test_cases]
+        """Run test cases and return results."""
+        prompt = self.load_prompt(prompt_version)
+        cases = load_all_test_cases(str(self.test_cases_dir))
 
+        if case_ids:
+            cases = [c for c in cases if c["id"] in case_ids]
         if categories:
-            all_cases = [case for case in all_cases if case["category"] in categories]
-
-        if audience:
-            all_cases = [case for case in all_cases if case.get("audience") == audience]
-
-        if explanation_type:
-            all_cases = [case for case in all_cases if case.get("explanation_type") == explanation_type]
-
-        if not all_cases:
-            raise ValueError("No test cases found matching the specified criteria")
-
-        print(
-            f"Running {len(all_cases)} test cases with prompt version: {prompt_version} "
-            f"(max {self.max_concurrent_requests} concurrent)"
-        )
+            cases = [c for c in cases if c.get("category") in categories]
+        if not cases:
+            raise ValueError("No test cases matched filters")
 
-        # Create tasks for all test cases
-        tasks = [self.run_single_test_async(case, prompt_version) for case in all_cases]
+        print(f"Running {len(cases)} test cases with prompt: {prompt_version}")
 
-        # Run tasks concurrently with progress tracking
+        tasks = [self._run_one(c, prompt) for c in cases]
         results = []
-
-        # Use asyncio.as_completed to get results as they finish
-        for completed, coro in enumerate(asyncio.as_completed(tasks), 1):
+        for i, coro in enumerate(asyncio.as_completed(tasks), 1):
             result = await coro
             results.append(result)
+            status = "✓" if result["success"] else "✗"
+            tokens = f"in={result.get('input_tokens', '?')} out={result.get('output_tokens', '?')}"
+            print(f"  [{i}/{len(cases)}] {status} {result['case_id']} ({tokens})")
+
+        successful = [r for r in results if r["success"]]
+        total_cost = sum(
+            r["input_tokens"] * 3 / 1e6 + r["output_tokens"] * 15 / 1e6  # Sonnet pricing
+            for r in successful
+        )
 
-            # Print progress
-            if result["success"]:
-                metrics = result["metrics"]
-                if metrics:
-                    print(f"  [{completed}/{len(all_cases)}] ✓ {result['case_id']}: {metrics['overall_score']:.2f}")
-                else:
-                    print(f"  [{completed}/{len(all_cases)}] ✓ {result['case_id']}: completed")
-            else:
-                print(f"  [{completed}/{len(all_cases)}] ✗ {result['case_id']}: {result['error']}")
-
-        # Calculate summary statistics
-        successful_results = [r for r in results if r["success"]]
-        summary = {
+        return {
             "prompt_version": prompt_version,
-            "total_cases": len(results),
-            "successful_cases": len(successful_results),
-            "failed_cases": len(results) - len(successful_results),
-            "success_rate": len(successful_results) / len(results) if results else 0,
+            "model": prompt.model,
             "timestamp": datetime.now().isoformat(),
+            "total_cases": len(results),
+            "successful": len(successful),
+            "failed": len(results) - len(successful),
+            "total_cost_usd": round(total_cost, 6),
+            "results": results,
         }
 
-        if successful_results:
-            # Calculate average metrics
-            all_metrics = [r["metrics"] for r in successful_results if r["metrics"]]
-            if all_metrics:
-                summary["average_metrics"] = {
-                    "overall_score": sum(m["overall_score"] for m in all_metrics) / len(all_metrics),
-                    "accuracy": sum(m["accuracy"] for m in all_metrics) / len(all_metrics),
-                    "relevance": sum(m["relevance"] for m in all_metrics) / len(all_metrics),
-                    "conciseness": sum(m["conciseness"] for m in all_metrics) / len(all_metrics),
-                    "insight": sum(m["insight"] for m in all_metrics) / len(all_metrics),
-                    "appropriateness": sum(m["appropriateness"] for m in all_metrics) / len(all_metrics),
-                    "average_tokens": sum(m["token_count"] for m in all_metrics) / len(all_metrics),
-                    "average_response_time": sum(m["response_time_ms"] or 0 for m in all_metrics) / len(all_metrics),
-                }
-
-        return {"summary": summary, "results": results}
-
-    def run_test_suite(
-        self,
-        prompt_version: str,
-        test_cases: list[str] | None = None,
-        categories: list[str] | None = None,
-        audience: str | None = None,
-        explanation_type: str | None = None,
-    ) -> dict[str, Any]:
-        """Synchronous wrapper for backward compatibility."""
-        return asyncio.run(
-            self.run_test_suite_async(prompt_version, test_cases, categories, audience, explanation_type)
-        )
-
-    def save_results(self, test_results: dict[str, Any], output_file: str | None = None) -> str:
-        """Save test results to a timestamped file."""
-        if not output_file:
-            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
-            prompt_version = test_results["summary"]["prompt_version"]
-            output_file = f"{timestamp}_{prompt_version}.json"
-
-        output_path = self.results_dir / output_file
-        save_json_results(test_results, output_path)
-        print(f"Results saved to: {output_path}")
-        return str(output_path)
-
-    async def compare_prompt_versions_async(
-        self, version1: str, version2: str, test_cases: list[str] | None = None
-    ) -> dict[str, Any]:
-        """Compare two prompt versions on the same test cases asynchronously."""
-
-        print(f"Comparing prompt versions: {version1} vs {version2}")
-
-        # Run both test suites concurrently
-        results1, results2 = await asyncio.gather(
-            self.run_test_suite_async(version1, test_cases), self.run_test_suite_async(version2, test_cases)
-        )
-
-        # Create comparison
-        comparison = {
-            "version1": version1,
-            "version2": version2,
-            "summary1": results1["summary"],
-            "summary2": results2["summary"],
-            "case_comparisons": [],
-        }
-
-        # Compare individual cases
-        results1_by_id = {r["case_id"]: r for r in results1["results"]}
-        results2_by_id = {r["case_id"]: r for r in results2["results"]}
-
-        for case_id in set(results1_by_id.keys()) & set(results2_by_id.keys()):
-            r1 = results1_by_id[case_id]
-            r2 = results2_by_id[case_id]
-
-            case_comparison = {
-                "case_id": case_id,
-                "version1_success": r1["success"],
-                "version2_success": r2["success"],
-            }
-
-            if r1["success"] and r2["success"] and r1["metrics"] and r2["metrics"]:
-                m1 = r1["metrics"]
-                m2 = r2["metrics"]
-                case_comparison.update(
-                    {
-                        "score_difference": m2["overall_score"] - m1["overall_score"],
-                        "version1_score": m1["overall_score"],
-                        "version2_score": m2["overall_score"],
-                        "better_version": version2 if m2["overall_score"] > m1["overall_score"] else version1,
-                        "accuracy_difference": m2["accuracy"] - m1["accuracy"],
-                        "relevance_difference": m2["relevance"] - m1["relevance"],
-                        "conciseness_difference": m2["conciseness"] - m1["conciseness"],
-                        "insight_difference": m2["insight"] - m1["insight"],
-                        "appropriateness_difference": m2["appropriateness"] - m1["appropriateness"],
-                    }
-                )
-
-            comparison["case_comparisons"].append(case_comparison)
-
-        return comparison
-
-    def compare_prompt_versions(
-        self, version1: str, version2: str, test_cases: list[str] | None = None
-    ) -> dict[str, Any]:
-        """Synchronous wrapper for backward compatibility."""
-        return asyncio.run(self.compare_prompt_versions_async(version1, version2, test_cases))
+    def run(self, **kwargs) -> dict[str, Any]:
+        """Synchronous wrapper."""
+        return asyncio.run(self.run_async(**kwargs))
+
+    def save(self, data: dict[str, Any], filename: str | None = None) -> Path:
+        """Save results to JSON."""
+        import json
+
+        if not filename:
+            ts = datetime.now().strftime("%Y%m%d_%H%M%S")
+            filename = f"{ts}_{data['prompt_version']}.json"
+        path = self.results_dir / filename
+        path.write_text(json.dumps(data, indent=2))
+        print(f"Saved to {path}")
+        return path
diff --git a/prompt_testing/templates/index.html b/prompt_testing/templates/index.html
deleted file mode 100644
index 0a11fb3..0000000
--- a/prompt_testing/templates/index.html
+++ /dev/null
@@ -1,137 +0,0 @@
-<!DOCTYPE html>
-<html lang="en">
-<head>
-    <meta charset="UTF-8">
-    <meta name="viewport" content="width=device-width, initial-scale=1.0">
-    <title>Prompt Review Interface</title>
-    <style>
-        * { box-sizing: border-box; }
-        body {
-            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
-            margin: 0; padding: 20px; background: #f5f5f5;
-        }
-        .container { max-width: 1200px; margin: 0 auto; }
-        .header {
-            background: white; padding: 20px; border-radius: 8px;
-            margin-bottom: 20px; box-shadow: 0 2px 4px rgba(0,0,0,0.1);
-        }
-        .results-grid {
-            display: grid;
-            grid-template-columns: repeat(auto-fill, minmax(300px, 1fr));
-            gap: 20px;
-        }
-        .result-card {
-            background: white; padding: 20px; border-radius: 8px;
-            box-shadow: 0 2px 4px rgba(0,0,0,0.1);
-            transition: transform 0.2s ease, box-shadow 0.2s ease;
-        }
-        .result-card:hover {
-            transform: translateY(-2px);
-            box-shadow: 0 4px 12px rgba(0,0,0,0.15);
-        }
-        .result-header {
-            display: flex; justify-content: space-between;
-            align-items: center; margin-bottom: 15px;
-        }
-        .prompt-version { font-weight: 600; font-size: 1.1em; color: #333; }
-        .timestamp { color: #666; font-size: 0.9em; }
-        .metrics { display: grid; grid-template-columns: 1fr 1fr; gap: 10px; margin-bottom: 15px; }
-        .metric { text-align: center; }
-        .metric-value { font-size: 1.4em; font-weight: 600; color: #2563eb; }
-        .metric-label { font-size: 0.85em; color: #666; }
-        .review-btn {
-            width: 100%; padding: 10px; background: #2563eb; color: white;
-            border: none; border-radius: 6px; cursor: pointer; font-weight: 500;
-            transition: background 0.2s ease;
-        }
-        .review-btn:hover { background: #1d4ed8; }
-        .loading { text-align: center; padding: 40px; color: #666; }
-        .empty-state { text-align: center; padding: 60px 20px; color: #666; }
-        .empty-state h3 { margin-bottom: 10px; }
-    </style>
-</head>
-<body>
-    <div class="container">
-        <div class="header">
-            <h1>🧪 Prompt Review Interface</h1>
-            <p>Select a test result file to review and rate prompt responses.</p>
-        </div>
-
-        <div id="loading" class="loading">
-            <p>Loading test results...</p>
-        </div>
-
-        <div id="results" class="results-grid" style="display: none;"></div>
-
-        <div id="empty" class="empty-state" style="display: none;">
-            <h3>No test results found</h3>
-            <p>Run some tests first using: <code>uv run prompt-test run --prompt [version]</code></p>
-        </div>
-    </div>
-
-    <script>
-        async function loadResults() {
-            try {
-                const response = await fetch('/api/results');
-                const data = await response.json();
-
-                document.getElementById('loading').style.display = 'none';
-
-                if (data.results.length === 0) {
-                    document.getElementById('empty').style.display = 'block';
-                    return;
-                }
-
-                const resultsContainer = document.getElementById('results');
-                resultsContainer.style.display = 'grid';
-
-                data.results.forEach(result => {
-                    const card = createResultCard(result);
-                    resultsContainer.appendChild(card);
-                });
-
-            } catch (error) {
-                document.getElementById('loading').innerHTML =
-                    '<p style="color: red;">Error loading results: ' + error.message + '</p>';
-            }
-        }
-
-        function createResultCard(result) {
-            const card = document.createElement('div');
-            card.className = 'result-card';
-
-            const successRate = (result.success_rate * 100).toFixed(0);
-            const avgScore = result.average_score ? result.average_score.toFixed(2) : 'N/A';
-
-            card.innerHTML = `
-                <div class="result-header">
-                    <div class="prompt-version">${result.description}</div>
-                    <div class="timestamp">${result.timestamp}</div>
-                </div>
-                <div class="metrics">
-                    <div class="metric">
-                        <div class="metric-value">${successRate}%</div>
-                        <div class="metric-label">Success Rate</div>
-                    </div>
-                    <div class="metric">
-                        <div class="metric-value">${avgScore}</div>
-                        <div class="metric-label">Avg Score</div>
-                    </div>
-                </div>
-                <button class="review-btn" onclick="openReview('${result.file}')">
-                    Review Results (${result.total_cases} cases)
-                </button>
-            `;
-
-            return card;
-        }
-
-        function openReview(filename) {
-            window.open(`/review/${filename}`, '_blank');
-        }
-
-        // Load results on page load
-        loadResults();
-    </script>
-</body>
-</html>
diff --git a/prompt_testing/templates/review.html b/prompt_testing/templates/review.html
deleted file mode 100644
index 6432e3f..0000000
--- a/prompt_testing/templates/review.html
+++ /dev/null
@@ -1,617 +0,0 @@
-<!DOCTYPE html>
-<html lang="en">
-<head>
-    <meta charset="UTF-8">
-    <meta name="viewport" content="width=device-width, initial-scale=1.0">
-    <title>Review: {{ data.summary.prompt_version if data.summary else filename }}</title>
-    <style>
-        * { box-sizing: border-box; }
-        body {
-            font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
-            margin: 0; padding: 20px; background: #f5f5f5; line-height: 1.6;
-        }
-        .container { max-width: 1000px; margin: 0 auto; }
-        .header {
-            background: white; padding: 20px; border-radius: 8px;
-            margin-bottom: 20px; box-shadow: 0 2px 4px rgba(0,0,0,0.1);
-        }
-        .case-card {
-            background: white; margin-bottom: 30px; border-radius: 8px;
-            box-shadow: 0 2px 4px rgba(0,0,0,0.1); overflow: hidden;
-            position: relative;
-        }
-        .case-card.reviewed {
-            border-left: 5px solid #16a34a;
-        }
-        .case-card.not-reviewed {
-            border-left: 5px solid #dc2626;
-        }
-        .case-header {
-            background: #f8f9fa; padding: 15px 20px;
-            border-bottom: 1px solid #e9ecef;
-            display: flex; justify-content: space-between; align-items: center;
-        }
-        .case-id { font-weight: 600; color: #333; }
-        .case-meta { font-size: 0.9em; color: #666; }
-        .review-status {
-            display: inline-flex; align-items: center; gap: 8px;
-            padding: 4px 12px; border-radius: 16px; font-size: 0.8em; font-weight: 500;
-        }
-        .review-status.reviewed {
-            background: #dcfce7; color: #166534;
-        }
-        .review-status.not-reviewed {
-            background: #fef2f2; color: #dc2626;
-        }
-        .progress-bar {
-            background: #f3f4f6; border-radius: 8px; height: 8px; margin-top: 10px;
-            overflow: hidden;
-        }
-        .progress-fill {
-            background: linear-gradient(90deg, #16a34a, #22c55e);
-            height: 100%; transition: width 0.3s ease;
-        }
-        .case-content { padding: 20px; }
-        .section { margin-bottom: 25px; }
-        .section h4 {
-            margin: 0 0 10px 0; color: #333; font-size: 0.95em;
-            text-transform: uppercase; letter-spacing: 0.5px;
-        }
-        .code-block {
-            background: #f8f9fa; padding: 15px; border-radius: 6px;
-            font-family: 'Monaco', 'Consolas', monospace;
-            font-size: 0.9em; overflow-x: auto; border-left: 4px solid #007acc;
-            white-space: pre-wrap;
-        }
-        .response-text {
-            background: #f0f9ff; padding: 15px; border-radius: 6px;
-            border-left: 4px solid #0ea5e9; line-height: 1.6;
-        }
-        .response-text pre {
-            background: #f8f9fa; padding: 10px; border-radius: 4px;
-            overflow-x: auto; font-size: 0.9em; margin: 10px 0;
-            line-height: 1.4;
-        }
-        .response-text :not(pre) > code {
-            background: #f1f5f9; padding: 2px 4px; border-radius: 3px;
-            font-family: 'Monaco', 'Consolas', monospace; font-size: 0.9em;
-        }
-        .response-text pre code {
-            background: none; padding: 0; border-radius: 0;
-            font-family: 'Monaco', 'Consolas', monospace; font-size: inherit;
-            display: block;
-        }
-        .response-text h1, .response-text h2, .response-text h3 {
-            margin-top: 20px; margin-bottom: 10px; color: #1e40af;
-        }
-        .response-text ul, .response-text ol {
-            margin: 10px 0; padding-left: 20px;
-        }
-        .response-text blockquote {
-            border-left: 3px solid #ddd; margin: 10px 0; padding-left: 15px;
-            color: #666; font-style: italic;
-        }
-        /* Basic syntax highlighting for assembly code */
-        .response-text .language-assembly {
-            background: #282c34; color: #abb2bf; padding: 15px; border-radius: 6px;
-            font-family: 'Monaco', 'Consolas', monospace; font-size: 0.85em;
-        }
-        .response-text .language-c, .response-text .language-cpp {
-            background: #f8f8f2; color: #272822; padding: 15px; border-radius: 6px;
-            font-family: 'Monaco', 'Consolas', monospace; font-size: 0.85em;
-        }
-        .response-text strong {
-            font-weight: 600; color: #1e40af;
-        }
-        .response-text em {
-            font-style: italic; color: #6366f1;
-        }
-        .metrics {
-            display: grid;
-            grid-template-columns: repeat(auto-fit, minmax(120px, 1fr));
-            gap: 15px; background: #f8f9fa; padding: 15px; border-radius: 6px;
-        }
-        .metrics-secondary {
-            display: grid;
-            grid-template-columns: repeat(auto-fit, minmax(120px, 1fr));
-            gap: 15px; background: #f1f3f4; padding: 15px; border-radius: 6px;
-            margin-top: 10px; border-top: 1px solid #e0e0e0;
-        }
-        .metric { text-align: center; }
-        .metric-value { font-size: 1.2em; font-weight: 600; color: #2563eb; }
-        .metric-label { font-size: 0.85em; color: #666; }
-        .review-form {
-            background: #fefce8; border: 1px solid #fde047;
-            padding: 20px; border-radius: 6px; margin-top: 20px;
-        }
-        .form-row {
-            display: grid;
-            grid-template-columns: repeat(auto-fit, minmax(150px, 1fr));
-            gap: 15px; margin-bottom: 15px;
-        }
-        .form-group { display: flex; flex-direction: column; }
-        .form-group label { font-weight: 500; margin-bottom: 5px; color: #333; }
-        .form-group input, .form-group textarea, .form-group select {
-            padding: 8px 12px; border: 1px solid #d1d5db;
-            border-radius: 4px; font-size: 0.9em;
-        }
-        .form-group textarea { resize: vertical; min-height: 60px; }
-        .save-btn {
-            background: #16a34a; color: white; border: none;
-            padding: 10px 20px; border-radius: 6px; cursor: pointer;
-            font-weight: 500; transition: background 0.2s ease;
-        }
-        .save-btn:hover { background: #15803d; }
-        .save-btn:disabled { background: #9ca3af; cursor: not-allowed; }
-        .success-msg { color: #16a34a; font-weight: 500; margin-left: 10px; }
-        .error-msg { color: #dc2626; font-weight: 500; margin-left: 10px; }
-        .reviewer-info {
-            background: #ede9fe; padding: 15px; border-radius: 6px;
-            margin-bottom: 20px; border-left: 4px solid #8b5cf6;
-        }
-        /* Tooltip styling */
-        label[title], input[title] {
-            cursor: help;
-        }
-        label[title]:hover {
-            color: #2563eb;
-        }
-        /* Side-by-side code layout */
-        .code-container {
-            display: grid;
-            grid-template-columns: 1fr 1fr;
-            gap: 20px;
-        }
-        .code-panel h5 {
-            margin: 0 0 10px 0;
-            font-size: 0.9em;
-            color: #555;
-            font-weight: 600;
-        }
-        /* Stack on mobile */
-        @media (max-width: 768px) {
-            .code-container {
-                grid-template-columns: 1fr;
-                gap: 15px;
-            }
-        }
-    </style>
-</head>
-<body>
-    <div class="container">
-        <div class="header">
-            <h1>📝 Review: {{ data.summary.prompt_version if data.summary else 'Unknown' }}</h1>
-            <p><strong>File:</strong> {{ filename }}</p>
-            {% if data.summary %}
-            <p><strong>Success Rate:</strong>
-               {{ "%.1f"|format(data.summary.success_rate * 100) }}%
-               ({{ data.summary.successful_cases }}/{{ data.summary.total_cases }} cases)</p>
-            {% if data.summary.average_metrics %}
-            <p><strong>Average Score:</strong>
-               {{ "%.2f"|format(data.summary.average_metrics.overall_score) }}</p>
-            {% endif %}
-            {% endif %}
-
-            <div id="review-progress" style="display: none;">
-                <p><strong>Review Progress:</strong> <span id="progress-text">Loading...</span></p>
-                <div class="progress-bar">
-                    <div class="progress-fill" id="progress-fill" style="width: 0%"></div>
-                </div>
-            </div>
-        </div>
-
-        <div class="reviewer-info">
-            <h4>👤 Reviewer Information</h4>
-            <div class="form-row">
-                <div class="form-group">
-                    <label for="reviewer-name">Your Name/ID:</label>
-                    <input type="text" id="reviewer-name" placeholder="e.g., john_doe" required>
-                </div>
-            </div>
-            <p><strong>📊 Quality Metrics:</strong> Rate responses using our 5 quality dimensions.
-               Hover over metric labels for detailed descriptions.</p>
-        </div>
-
-        {% if data.results %}
-        {% for result in data.results %}
-        {% if result.success %}
-        <div class="case-card">
-            <div class="case-header">
-                <div>
-                    <div class="case-id">{{ result.case_id }}</div>
-                    <div class="case-meta">
-                        {{ result.test_case.language }} | {{ result.test_case.compiler }} |
-                        Audience: {{ result.test_case.audience }} |
-                        Type: {{ result.test_case.explanation_type }}
-                    </div>
-                </div>
-                <div class="review-status not-reviewed" id="status-{{ result.case_id }}">
-                    <span>⚪</span> Not Reviewed
-                </div>
-            </div>
-
-            <div class="case-content">
-                <div class="section">
-                    <h4>💻 Code & Assembly</h4>
-                    <div class="code-container">
-                        <div class="code-panel">
-                            <h5>💾 Source Code</h5>
-                            <div class="code-block">{{ result.test_case.source_code }}</div>
-                        </div>
-                        <div class="code-panel">
-                            <h5>⚙️ Assembly Output</h5>
-                            <div class="code-block">{% if result.test_case.assembly %}{% for asm_item in result.test_case.assembly %}{{ asm_item.text }}
-{% endfor %}{% else %}Assembly not available{% endif %}</div>
-                        </div>
-                    </div>
-                </div>
-
-                <div class="section">
-                    <h4>🤖 AI Response</h4>
-                    <div class="response-text" id="response-{{ result.case_id }}">
-                        {{ result.response }}
-                    </div>
-                </div>
-
-                {% if result.metrics %}
-                <div class="section">
-                    <h4>📊 Automatic Metrics</h4>
-                    <div class="metrics">
-                        <div class="metric">
-                            <div class="metric-value">
-                                {{ "%.1f"|format(result.metrics.accuracy * 5) }}
-                            </div>
-                            <div class="metric-label">Accuracy</div>
-                        </div>
-                        <div class="metric">
-                            <div class="metric-value">
-                                {{ "%.1f"|format(result.metrics.relevance * 5) }}
-                            </div>
-                            <div class="metric-label">Relevance</div>
-                        </div>
-                        <div class="metric">
-                            <div class="metric-value">
-                                {{ "%.1f"|format(result.metrics.conciseness * 5) }}
-                            </div>
-                            <div class="metric-label">Conciseness</div>
-                        </div>
-                        <div class="metric">
-                            <div class="metric-value">
-                                {{ "%.1f"|format(result.metrics.insight * 5) }}
-                            </div>
-                            <div class="metric-label">Insight</div>
-                        </div>
-                        <div class="metric">
-                            <div class="metric-value">
-                                {{ "%.1f"|format(result.metrics.appropriateness * 5) }}
-                            </div>
-                            <div class="metric-label">Appropriateness</div>
-                        </div>
-                    </div>
-                    <div class="metrics-secondary">
-                        <div class="metric">
-                            <div class="metric-value">
-                                {{ "%.2f"|format(result.metrics.overall_score) }}
-                            </div>
-                            <div class="metric-label">Overall</div>
-                        </div>
-                        <div class="metric">
-                            <div class="metric-value">{{ result.metrics.token_count }}</div>
-                            <div class="metric-label">Tokens</div>
-                        </div>
-                        <div class="metric">
-                            <div class="metric-value">{{ result.metrics.response_time_ms }}</div>
-                            <div class="metric-label">Time (ms)</div>
-                        </div>
-                    </div>
-                </div>
-                {% endif %}
-
-                <div class="review-form">
-                    <h4>✏️ Your Review</h4>
-                    <div class="form-row">
-                        <div class="form-group">
-                            <label title="Technical correctness without false claims">Accuracy (1-5):</label>
-                            <input type="number" min="1" max="5" class="score-input"
-                                   name="accuracy" data-case="{{ result.case_id }}"
-                                   title="Technical correctness without false claims">
-                        </div>
-                        <div class="form-group">
-                            <label title="Discusses actual code, recognizes optimization level">Relevance (1-5):</label>
-                            <input type="number" min="1" max="5" class="score-input"
-                                   name="relevance" data-case="{{ result.case_id }}"
-                                   title="Discusses actual code, recognizes optimization level">
-                        </div>
-                        <div class="form-group">
-                            <label title="Direct explanation without filler or boilerplate">Conciseness (1-5):</label>
-                            <input type="number" min="1" max="5" class="score-input"
-                                   name="conciseness" data-case="{{ result.case_id }}"
-                                   title="Direct explanation without filler or boilerplate">
-                        </div>
-                        <div class="form-group">
-                            <label title="Explains WHY and provides actionable understanding">Insight (1-5):</label>
-                            <input type="number" min="1" max="5" class="score-input"
-                                   name="insight" data-case="{{ result.case_id }}"
-                                   title="Explains WHY and provides actionable understanding">
-                        </div>
-                        <div class="form-group">
-                            <label title="Matches audience level and explanation type">Appropriateness (1-5):</label>
-                            <input type="number" min="1" max="5" class="score-input"
-                                   name="appropriateness" data-case="{{ result.case_id }}"
-                                   title="Matches audience level and explanation type">
-                        </div>
-                    </div>
-
-                    <div class="form-group">
-                        <label>Strengths (one per line):</label>
-                        <textarea name="strengths" data-case="{{ result.case_id }}"
-                                  placeholder="Clear technical explanations&#10;Good use of examples&#10;Appropriate for audience level"></textarea>
-                    </div>
-
-                    <div class="form-group">
-                        <label>Weaknesses (one per line):</label>
-                        <textarea name="weaknesses" data-case="{{ result.case_id }}"
-                                  placeholder="Missing key insights&#10;Too verbose for the audience&#10;Unclear instruction explanations"></textarea>
-                    </div>
-
-                    <div class="form-group">
-                        <label>Suggestions (one per line):</label>
-                        <textarea name="suggestions" data-case="{{ result.case_id }}"
-                                  placeholder="Add more WHY explanations&#10;Include performance implications&#10;Simplify technical terms"></textarea>
-                    </div>
-
-                    <div class="form-group">
-                        <label>Overall Comments:</label>
-                        <textarea name="overall_comments" data-case="{{ result.case_id }}"
-                                  placeholder="General feedback"></textarea>
-                    </div>
-
-                    <button class="save-btn" onclick="saveReview('{{ result.case_id }}')">
-                        Save Review for {{ result.case_id }}
-                    </button>
-                    <span class="message" id="message-{{ result.case_id }}"></span>
-                </div>
-            </div>
-        </div>
-        {% endif %}
-        {% endfor %}
-        {% endif %}
-    </div>
-
-    <!-- Include marked library from CDN -->
-    <script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script>
-
-    <script>
-        // Configure marked to match Compiler Explorer
-        marked.setOptions({
-            gfm: true,
-            breaks: true
-        });
-
-        // Global state for existing reviews
-        let existingReviews = {};
-        let reviewProgress = { total: 0, reviewed: 0 };
-
-        // Initialize page on load
-        document.addEventListener('DOMContentLoaded', function() {
-            // Load saved reviewer name from localStorage
-            const savedReviewer = localStorage.getItem('reviewerName');
-            if (savedReviewer) {
-                document.getElementById('reviewer-name').value = savedReviewer;
-            }
-
-            // Save reviewer name to localStorage when changed
-            document.getElementById('reviewer-name').addEventListener('input', function() {
-                localStorage.setItem('reviewerName', this.value);
-            });
-
-            // Load existing reviews for this prompt version
-            loadExistingReviews();
-
-            // Render all markdown responses
-            {% for result in data.results %}
-            {% if result.success %}
-            const responseEl_{{ result.case_id | replace("-", "_") }} = document.getElementById('response-{{ result.case_id }}');
-            if (responseEl_{{ result.case_id | replace("-", "_") }}) {
-                const markdown = {{ result.response | tojson }};
-                responseEl_{{ result.case_id | replace("-", "_") }}.innerHTML = marked.parse(markdown);
-            }
-            {% endif %}
-            {% endfor %}
-        });
-
-        async function loadExistingReviews() {
-            try {
-                const promptVersion = '{{ data.summary.prompt_version if data.summary else "unknown" }}';
-                const response = await fetch(`/api/reviews/prompt/${encodeURIComponent(promptVersion)}`);
-                const data = await response.json();
-
-                existingReviews = data.reviews_by_case || {};
-
-                // Update UI for each case
-                const caseCards = document.querySelectorAll('.case-card');
-                reviewProgress.total = caseCards.length;
-                reviewProgress.reviewed = 0;
-
-                caseCards.forEach(card => {
-                    const caseId = extractCaseIdFromCard(card);
-                    if (caseId && existingReviews[caseId]) {
-                        updateCaseReviewStatus(caseId, true, existingReviews[caseId]);
-                        reviewProgress.reviewed++;
-                    } else if (caseId) {
-                        updateCaseReviewStatus(caseId, false);
-                    }
-                });
-
-                updateProgressIndicator();
-
-            } catch (error) {
-                console.error('Failed to load existing reviews:', error);
-            }
-        }
-
-        function extractCaseIdFromCard(card) {
-            const statusElement = card.querySelector('[id^="status-"]');
-            if (statusElement) {
-                return statusElement.id.replace('status-', '');
-            }
-            return null;
-        }
-
-        function updateCaseReviewStatus(caseId, isReviewed, reviewData = null) {
-            const card = document.querySelector(`#status-${caseId}`).closest('.case-card');
-            const statusElement = document.getElementById(`status-${caseId}`);
-            const saveButton = document.querySelector(`button[onclick="saveReview('${caseId}')"]`);
-
-            if (isReviewed) {
-                // Mark as reviewed
-                card.classList.remove('not-reviewed');
-                card.classList.add('reviewed');
-                statusElement.classList.remove('not-reviewed');
-                statusElement.classList.add('reviewed');
-                statusElement.innerHTML = '<span>✅</span> Reviewed';
-
-                if (saveButton) {
-                    saveButton.textContent = `Update Review for ${caseId}`;
-                }
-
-                // Pre-populate form if review data available
-                if (reviewData) {
-                    populateReviewForm(caseId, reviewData);
-                }
-            } else {
-                // Mark as not reviewed
-                card.classList.remove('reviewed');
-                card.classList.add('not-reviewed');
-                statusElement.classList.remove('reviewed');
-                statusElement.classList.add('not-reviewed');
-                statusElement.innerHTML = '<span>⚪</span> Not Reviewed';
-
-                if (saveButton) {
-                    saveButton.textContent = `Save Review for ${caseId}`;
-                }
-            }
-        }
-
-        function populateReviewForm(caseId, reviewData) {
-            // Populate numeric scores
-            const fields = ['accuracy', 'relevance', 'conciseness', 'insight', 'appropriateness'];
-            fields.forEach(field => {
-                const input = document.querySelector(`input[name="${field}"][data-case="${caseId}"]`);
-                if (input && reviewData[field]) {
-                    input.value = reviewData[field];
-                }
-            });
-
-            // Populate text areas
-            const textFields = ['strengths', 'weaknesses', 'suggestions', 'overall_comments'];
-            textFields.forEach(field => {
-                const textarea = document.querySelector(`textarea[name="${field}"][data-case="${caseId}"]`);
-                if (textarea && reviewData[field]) {
-                    if (Array.isArray(reviewData[field])) {
-                        textarea.value = reviewData[field].join('\n');
-                    } else {
-                        textarea.value = reviewData[field];
-                    }
-                }
-            });
-        }
-
-        function updateProgressIndicator() {
-            const progressContainer = document.getElementById('review-progress');
-            const progressText = document.getElementById('progress-text');
-            const progressFill = document.getElementById('progress-fill');
-
-            if (reviewProgress.total > 0) {
-                const percentage = Math.round((reviewProgress.reviewed / reviewProgress.total) * 100);
-                progressText.textContent = `${reviewProgress.reviewed}/${reviewProgress.total} cases reviewed (${percentage}%)`;
-                progressFill.style.width = `${percentage}%`;
-                progressContainer.style.display = 'block';
-            }
-        }
-
-        async function saveReview(caseId) {
-            const reviewerName = document.getElementById('reviewer-name').value.trim();
-            if (!reviewerName) {
-                alert('Please enter your name/ID before saving reviews.');
-                return;
-            }
-
-            const button = event.target;
-            const messageEl = document.getElementById('message-' + caseId);
-
-            button.disabled = true;
-            button.textContent = 'Saving...';
-            messageEl.textContent = '';
-
-            try {
-                const formData = {
-                    case_id: caseId,
-                    prompt_version: '{{ data.summary.prompt_version if data.summary else "unknown" }}',
-                    reviewer: reviewerName,
-                    accuracy: document.querySelector(
-                        `input[name="accuracy"][data-case="${caseId}"]`).value,
-                    relevance: document.querySelector(
-                        `input[name="relevance"][data-case="${caseId}"]`).value,
-                    conciseness: document.querySelector(
-                        `input[name="conciseness"][data-case="${caseId}"]`).value,
-                    insight: document.querySelector(
-                        `input[name="insight"][data-case="${caseId}"]`).value,
-                    appropriateness: document.querySelector(
-                        `input[name="appropriateness"][data-case="${caseId}"]`).value,
-                    strengths: document.querySelector(
-                        `textarea[name="strengths"][data-case="${caseId}"]`).value,
-                    weaknesses: document.querySelector(
-                        `textarea[name="weaknesses"][data-case="${caseId}"]`).value,
-                    suggestions: document.querySelector(
-                        `textarea[name="suggestions"][data-case="${caseId}"]`).value,
-                    overall_comments: document.querySelector(
-                        `textarea[name="overall_comments"][data-case="${caseId}"]`).value
-                };
-
-                if (!formData.accuracy || !formData.relevance ||
-                    !formData.conciseness || !formData.insight || !formData.appropriateness) {
-                    throw new Error('Please fill in all rating scores (1-5)');
-                }
-
-                const response = await fetch('/api/review', {
-                    method: 'POST',
-                    headers: { 'Content-Type': 'application/json' },
-                    body: JSON.stringify(formData)
-                });
-
-                const result = await response.json();
-
-                if (result.success) {
-                    const wasUpdate = existingReviews[caseId] !== undefined;
-                    messageEl.textContent = wasUpdate ? '✅ Updated!' : '✅ Saved!';
-                    messageEl.className = 'success-msg';
-
-                    // Update the global state
-                    existingReviews[caseId] = formData;
-
-                    // Update UI to show reviewed status
-                    if (!wasUpdate) {
-                        reviewProgress.reviewed++;
-                    }
-                    updateCaseReviewStatus(caseId, true, formData);
-                    updateProgressIndicator();
-
-                    // Don't disable inputs for updates, allow further editing
-                } else {
-                    throw new Error(result.error || 'Failed to save review');
-                }
-
-            } catch (error) {
-                messageEl.textContent = '❌ Error: ' + error.message;
-                messageEl.className = 'error-msg';
-            } finally {
-                button.disabled = false;
-                // Update button text based on current review status
-                const isReviewed = existingReviews[caseId] !== undefined;
-                button.textContent = isReviewed ? `Update Review for ${caseId}` : `Save Review for ${caseId}`;
-            }
-        }
-    </script>
-</body>
-</html>
diff --git a/prompt_testing/test_cli_utils.py b/prompt_testing/test_cli_utils.py
deleted file mode 100644
index 67aa6ee..0000000
--- a/prompt_testing/test_cli_utils.py
+++ /dev/null
@@ -1,101 +0,0 @@
-"""Tests for CLI utility functions."""
-
-import json
-
-from prompt_testing.cli import _filter_compilers, _generate_compiler_mapping
-
-
-class MockCompiler:
-    """Simple mock compiler object."""
-
-    def __init__(self, id, name, instruction_set=None):
-        self.id = id
-        self.name = name
-        self.instruction_set = instruction_set
-
-
-def test_filter_compilers_by_instruction_set(capsys):
-    """Test filtering compilers by instruction set."""
-    # Mock compiler objects
-    compilers = [
-        MockCompiler("gcc1", "GCC 1", "x86_64"),
-        MockCompiler("gcc2", "GCC 2", "arm64"),
-        MockCompiler("gcc3", "GCC 3", "x86_64"),
-    ]
-
-    # Filter
-    filtered = _filter_compilers(compilers, "x86_64", None)
-
-    # Should only have x86_64 compilers
-    assert len(filtered) == 2
-    assert all(c.instruction_set == "x86_64" for c in filtered)
-
-    # Check printed output
-    captured = capsys.readouterr()
-    assert "Filtered to 2 compilers" in captured.out
-
-
-def test_filter_compilers_by_search(capsys):
-    """Test filtering compilers by search term."""
-    compilers = [
-        MockCompiler("gcc1210", "x86-64 gcc 12.1"),
-        MockCompiler("clang1500", "x86-64 clang 15.0.0"),
-        MockCompiler("gcc1310", "x86-64 gcc 13.1"),
-    ]
-
-    # Search for "gcc"
-    filtered = _filter_compilers(compilers, None, "gcc")
-
-    assert len(filtered) == 2
-    assert all("gcc" in c.name.lower() for c in filtered)
-
-    # Search by ID
-    filtered = _filter_compilers(compilers, None, "clang1500")
-
-    assert len(filtered) == 1
-    assert filtered[0].id == "clang1500"
-
-
-def test_filter_compilers_combined(capsys):
-    """Test filtering with multiple criteria."""
-    compilers = [
-        MockCompiler("gcc1", "x86-64 gcc 12.1", "x86_64"),
-        MockCompiler("gcc2", "arm gcc 12.1", "arm64"),
-        MockCompiler("clang1", "x86-64 clang 15.0", "x86_64"),
-    ]
-
-    # Filter by instruction set AND search
-    filtered = _filter_compilers(compilers, "x86_64", "gcc")
-
-    assert len(filtered) == 1
-    assert filtered[0].id == "gcc1"
-
-
-def test_generate_compiler_mapping(tmp_path):
-    """Test generating compiler name to ID mapping."""
-    compilers = [
-        MockCompiler("gcc1210", "x86-64 gcc 12.1"),
-        MockCompiler("gcc1310", "x86-64 gcc 13.1"),
-        MockCompiler("clang1500", "x86-64 clang 15.0.0"),
-    ]
-
-    output_file = tmp_path / "mapping.json"
-
-    # Generate mapping
-    _generate_compiler_mapping(compilers, output_file)
-
-    # Load and verify
-    with output_file.open() as f:
-        mapping = json.load(f)
-
-    # Should have full names
-    assert mapping["x86-64 gcc 12.1"] == "gcc1210"
-    assert mapping["x86-64 gcc 13.1"] == "gcc1310"
-    assert mapping["x86-64 clang 15.0.0"] == "clang1500"
-
-    # Should have short names for gcc
-    assert mapping["gcc 12.1"] == "gcc1210"
-    assert mapping["gcc 13.1"] == "gcc1310"
-
-    # Should not have short name for clang (not implemented)
-    assert "clang 15.0.0" not in mapping
diff --git a/prompt_testing/test_scorer.py b/prompt_testing/test_scorer.py
deleted file mode 100644
index 72d4add..0000000
--- a/prompt_testing/test_scorer.py
+++ /dev/null
@@ -1,102 +0,0 @@
-"""Tests for the scorer module."""
-
-import tempfile
-from pathlib import Path
-
-import pytest
-from ruamel.yaml import YAMLError
-
-from prompt_testing.evaluation.scorer import load_all_test_cases, load_test_case
-
-
-def test_load_test_case():
-    """Test loading a specific test case from a YAML file."""
-    # Create a temporary test file
-    with tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False) as f:
-        f.write("""
-description: "Test cases for testing"
-cases:
-  - id: test_case_1
-    category: basic
-    description: "First test case"
-    input:
-      language: C++
-      code: "int main() {}"
-  - id: test_case_2
-    category: advanced
-    description: "Second test case"
-    input:
-      language: Python
-      code: "def main(): pass"
-""")
-        temp_path = f.name
-
-    try:
-        # Test loading existing case
-        case = load_test_case(temp_path, "test_case_1")
-        assert case["id"] == "test_case_1"
-        assert case["category"] == "basic"
-        assert case["input"]["language"] == "C++"
-
-        # Test loading non-existent case
-        with pytest.raises(ValueError, match="Test case unknown_case not found"):
-            load_test_case(temp_path, "unknown_case")
-    finally:
-        Path(temp_path).unlink()
-
-
-def test_load_all_test_cases():
-    """Test loading all test cases from a directory."""
-    with tempfile.TemporaryDirectory() as temp_dir:
-        temp_path = Path(temp_dir)
-
-        # Create test files
-        file1 = temp_path / "test1.yaml"
-        file1.write_text("""
-description: "First test file"
-cases:
-  - id: case1
-    category: basic
-  - id: case2
-    category: advanced
-""")
-
-        file2 = temp_path / "test2.yaml"
-        file2.write_text("""
-description: "Second test file"
-cases:
-  - id: case3
-    category: expert
-""")
-
-        # Test loading all cases
-        cases = load_all_test_cases(temp_dir)
-        assert len(cases) == 3
-        assert {case["id"] for case in cases} == {"case1", "case2", "case3"}
-        assert {case["category"] for case in cases} == {"basic", "advanced", "expert"}
-
-
-def test_load_test_case_with_missing_file():
-    """Test loading from non-existent file."""
-    with pytest.raises(FileNotFoundError):
-        load_test_case("/non/existent/file.yaml", "some_case")
-
-
-def test_load_all_test_cases_empty_dir():
-    """Test loading from empty directory."""
-    with tempfile.TemporaryDirectory() as temp_dir:
-        cases = load_all_test_cases(temp_dir)
-        assert cases == []
-
-
-def test_load_test_case_malformed_yaml():
-    """Test loading malformed YAML file."""
-    with tempfile.NamedTemporaryFile(mode="w", suffix=".yaml", delete=False) as f:
-        f.write("{ invalid yaml content [")
-        temp_path = f.name
-
-    try:
-        with pytest.raises(YAMLError):  # YAML parsing error
-            load_test_case(temp_path, "test_case")
-    finally:
-        Path(temp_path).unlink()
diff --git a/prompt_testing/web_review.py b/prompt_testing/web_review.py
deleted file mode 100644
index e29add0..0000000
--- a/prompt_testing/web_review.py
+++ /dev/null
@@ -1,269 +0,0 @@
-"""
-Flask web interface for interactive prompt review.
-"""
-
-import json
-import webbrowser
-from datetime import datetime
-from pathlib import Path
-from threading import Timer
-
-from flask import Flask, jsonify, render_template, request
-
-from prompt_testing.evaluation.reviewer import HumanReview, ReviewManager
-from prompt_testing.evaluation.scorer import load_all_test_cases
-
-
-class ReviewWebServer:
-    """Web server for interactive prompt review."""
-
-    def __init__(self, project_root: str, port: int = 5000):
-        self.project_root = Path(project_root)
-        self.results_dir = self.project_root / "prompt_testing" / "results"
-        self.test_cases_dir = self.project_root / "prompt_testing" / "test_cases"
-        self.review_manager = ReviewManager(str(self.results_dir))
-        self.port = port
-
-        # Load test cases for enriching results with original data
-        self.test_cases = {}
-        try:
-            all_cases = load_all_test_cases(str(self.test_cases_dir))
-            for case in all_cases:
-                self.test_cases[case["id"]] = case
-        except Exception:
-            # If test cases can't be loaded, continue without enrichment
-            pass
-
-        template_dir = Path(__file__).parent / "templates"
-        self.app = Flask(__name__, template_folder=str(template_dir))
-        self.app.json.compact = False  # Pretty print JSON responses
-
-        self._setup_routes()
-
-    def _enrich_result_with_test_case(self, result: dict) -> dict:
-        """Enrich a result with test case information."""
-        case_id = result.get("case_id")
-        if case_id and case_id in self.test_cases:
-            test_case = self.test_cases[case_id]
-            input_data = test_case.get("input", {})
-
-            # Add test case metadata to the result
-            result["test_case"] = {
-                "language": input_data.get("language", "Unknown"),
-                "compiler": input_data.get("compiler", "Unknown"),
-                "compilation_options": input_data.get("compilationOptions", []),
-                "instruction_set": input_data.get("instructionSet", "Unknown"),
-                "source_code": input_data.get("code", ""),
-                "assembly": input_data.get("asm", []),
-                "description": test_case.get("description", ""),
-                "category": test_case.get("category", ""),
-                "quality": test_case.get("quality", ""),
-                "audience": test_case.get("audience", "beginner"),
-                "explanation_type": test_case.get("explanation_type", "assembly"),
-            }
-        else:
-            # Fallback if test case not found
-            result["test_case"] = {
-                "language": "Unknown",
-                "compiler": "Unknown",
-                "compilation_options": [],
-                "instruction_set": "Unknown",
-                "source_code": "",
-                "assembly": [],
-                "description": f"Test case {case_id}",
-                "category": "",
-                "quality": "",
-                "audience": "beginner",
-                "explanation_type": "assembly",
-            }
-        return result
-
-    def _get_result_description(self, summary: dict, filename: str) -> str:
-        """Generate a descriptive label for a result file."""
-        prompt_version = summary.get("prompt_version", "unknown")
-        total_cases = summary.get("total_cases", 0)
-
-        # Better description based on prompt version
-        if prompt_version == "current":
-            prompt_desc = "Current Production Prompt"
-        elif "baseline" in prompt_version:
-            prompt_desc = f"Baseline Prompt ({prompt_version})"
-        elif "improved" in prompt_version:
-            prompt_desc = f"Improved Prompt ({prompt_version})"
-        else:
-            prompt_desc = prompt_version.replace("_", " ").title()
-
-        # Add context from filename
-        if "comparison" in filename:
-            prompt_desc = f"Comparison: {prompt_desc}"
-
-        return f"{prompt_desc} - {total_cases} cases"
-
-    def _setup_routes(self):
-        """Set up Flask routes."""
-
-        @self.app.route("/")
-        def index():
-            """Main review interface."""
-            return render_template("index.html")
-
-        @self.app.route("/api/results")
-        def list_results():
-            """List available result files."""
-            if not self.results_dir.exists():
-                return jsonify({"results": []})
-
-            results = []
-            for result_file in sorted(self.results_dir.glob("*.json"), reverse=True):
-                if (
-                    result_file.name.startswith("analysis_")
-                    or result_file.name.startswith("human_")
-                    or result_file.name.startswith("comparison_")
-                ):
-                    continue
-
-                try:
-                    with result_file.open() as f:
-                        data = json.load(f)
-
-                    summary = data.get("summary", {})
-
-                    # Format timestamp for display
-                    timestamp = summary.get("timestamp", "")
-                    if timestamp:
-                        try:
-                            from datetime import datetime
-
-                            dt = datetime.fromisoformat(timestamp.replace("Z", "+00:00"))
-                            display_timestamp = dt.strftime("%m/%d/%Y %H:%M")
-                        except Exception:
-                            display_timestamp = timestamp[:10]  # Just date part
-                    else:
-                        display_timestamp = "Unknown"
-
-                    results.append(
-                        {
-                            "file": result_file.name,
-                            "prompt_version": summary.get("prompt_version", "unknown"),
-                            "description": self._get_result_description(summary, result_file.name),
-                            "timestamp": display_timestamp,
-                            "success_rate": summary.get("success_rate", 0),
-                            "total_cases": summary.get("total_cases", 0),
-                            "average_score": summary.get("average_metrics", {}).get("overall_score", 0),
-                        }
-                    )
-                except Exception:
-                    continue
-
-            return jsonify({"results": results})
-
-        @self.app.route("/api/results/<filename>")
-        def get_result_details(filename):
-            """Get detailed results for a specific file."""
-            result_file = self.results_dir / filename
-            if not result_file.exists():
-                return jsonify({"error": "Result file not found"}), 404
-
-            try:
-                with result_file.open() as f:
-                    data = json.load(f)
-                return jsonify(data)
-            except Exception as e:
-                return jsonify({"error": str(e)}), 500
-
-        @self.app.route("/review/<filename>")
-        def review_results(filename):
-            """Review interface for a specific result file."""
-            result_file = self.results_dir / filename
-            if not result_file.exists():
-                return "Result file not found", 404
-
-            try:
-                with result_file.open() as f:
-                    data = json.load(f)
-
-                # Enrich results with test case information
-                if "results" in data:
-                    for result in data["results"]:
-                        if result.get("success"):
-                            self._enrich_result_with_test_case(result)
-
-                return render_template("review.html", filename=filename, data=data)
-            except Exception as e:
-                return f"Error loading results: {e}", 500
-
-        @self.app.route("/api/review", methods=["POST"])
-        def save_review():
-            """Save a human review."""
-            try:
-                review_data = request.json
-
-                # Create HumanReview object
-                review = HumanReview(
-                    case_id=review_data["case_id"],
-                    prompt_version=review_data["prompt_version"],
-                    reviewer=review_data["reviewer"],
-                    timestamp=datetime.now().isoformat(),
-                    accuracy=int(review_data["accuracy"]),
-                    relevance=int(review_data["relevance"]),
-                    conciseness=int(review_data["conciseness"]),
-                    insight=int(review_data["insight"]),
-                    appropriateness=int(review_data["appropriateness"]),
-                    strengths=[s.strip() for s in review_data.get("strengths", "").split("\n") if s.strip()],
-                    weaknesses=[w.strip() for w in review_data.get("weaknesses", "").split("\n") if w.strip()],
-                    suggestions=[s.strip() for s in review_data.get("suggestions", "").split("\n") if s.strip()],
-                    overall_comments=review_data.get("overall_comments", ""),
-                    compared_to=review_data.get("compared_to"),
-                    preference=review_data.get("preference"),
-                    preference_reason=review_data.get("preference_reason"),
-                )
-
-                # Save review
-                self.review_manager.save_review(review)
-
-                return jsonify({"success": True, "message": "Review saved successfully"})
-
-            except Exception as e:
-                return jsonify({"success": False, "error": str(e)}), 500
-
-        @self.app.route("/api/reviews/<case_id>")
-        def get_reviews(case_id):
-            """Get existing reviews for a case."""
-            reviews = self.review_manager.load_reviews(case_id=case_id)
-            return jsonify({"reviews": [review.__dict__ for review in reviews]})
-
-        @self.app.route("/api/reviews/prompt/<prompt_version>")
-        def get_reviews_for_prompt(prompt_version):
-            """Get all existing reviews for a specific prompt version."""
-            reviews = self.review_manager.load_reviews(prompt_version=prompt_version)
-            # Group reviews by case_id for easier frontend consumption
-            #
-            # Design Note: We use direct assignment (latest review wins) rather than arrays
-            # because this UI is designed for single-reviewer prompt iteration workflows.
-            # The underlying JSONL storage preserves all reviews if multi-reviewer support
-            # is needed later. For current use case (prompt optimization), we want:
-            # - Simple binary state (reviewed/not reviewed)
-            # - Clear completion tracking without reviewer consensus complexity
-            # - UI optimized for iteration speed rather than comprehensive evaluation
-            # If multiple reviewers become common, this can be evolved to support arrays.
-            reviews_by_case = {}
-            for review in reviews:
-                reviews_by_case[review.case_id] = review.__dict__  # Latest review per case
-            return jsonify({"reviews_by_case": reviews_by_case, "total_reviews": len(reviews)})
-
-    def start(self, open_browser: bool = True):
-        """Start the web server."""
-        if open_browser:
-            # Open browser after a short delay
-            Timer(1.0, lambda: webbrowser.open(f"http://localhost:{self.port}")).start()
-
-        print(f"🚀 Review interface starting at http://localhost:{self.port}")
-        print("Press Ctrl+C to stop the server")
-
-        self.app.run(host="127.0.0.1", port=self.port, debug=False)
-
-
-def start_review_server(project_root: str, port: int = 5000, open_browser: bool = True):
-    """Start the review web server."""
-    server = ReviewWebServer(project_root, port)
-    server.start(open_browser)
diff --git a/pyproject.toml b/pyproject.toml
index b56b6e5..cdc0766 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -10,9 +10,7 @@ dependencies = [
     "boto3>=1.40.30",
     "click>=8.2.1",
     "fastapi>=0.116.1",
-    "flask>=3.1.2",
     "humanfriendly>=10.0",
-    "jinja2>=3.1.6",
     "mangum>=0.19.0",
     "pydantic-settings>=2.10.1",
     "python-dotenv>=1.1.1",
diff --git a/uv.lock b/uv.lock
index dd58793..4d938ba 100644
--- a/uv.lock
+++ b/uv.lock
@@ -1,5 +1,5 @@
 version = 1
-revision = 2
+revision = 3
 requires-python = ">=3.13"
 
 [[package]]
@@ -118,15 +118,6 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/43/4f/3161e284408d39b6bf444e209b52c880f0647274fec9da05541567f3bf99/aws_embedded_metrics-3.3.0-py3-none-any.whl", hash = "sha256:03901a28786a93e718ddb7342a917c3a3c8204c31195edb36151d8b7eef361b3", size = 40542, upload-time = "2025-02-06T17:22:00.133Z" },
 ]
 
-[[package]]
-name = "blinker"
-version = "1.9.0"
-source = { registry = "https://pypi.org/simple" }
-sdist = { url = "https://files.pythonhosted.org/packages/21/28/9b3f50ce0e048515135495f198351908d99540d69bfdc8c1d15b73dc55ce/blinker-1.9.0.tar.gz", hash = "sha256:b4ce2265a7abece45e7cc896e98dbebe6cead56bcf805a3d23136d145f5445bf", size = 22460, upload-time = "2024-11-08T17:25:47.436Z" }
-wheels = [
-    { url = "https://files.pythonhosted.org/packages/10/cb/f2ad4230dc2eb1a74edf38f1a38b9b52277f75bef262d8908e60d957e13c/blinker-1.9.0-py3-none-any.whl", hash = "sha256:ba0efaa9080b619ff2f3459d1d500c57bddea4a6b424b60a91141db6fd2f08bc", size = 8458, upload-time = "2024-11-08T17:25:46.184Z" },
-]
-
 [[package]]
 name = "boto3"
 version = "1.40.30"
@@ -275,9 +266,7 @@ dependencies = [
     { name = "boto3" },
     { name = "click" },
     { name = "fastapi" },
-    { name = "flask" },
     { name = "humanfriendly" },
-    { name = "jinja2" },
     { name = "mangum" },
     { name = "pydantic-settings" },
     { name = "python-dotenv" },
@@ -301,9 +290,7 @@ requires-dist = [
     { name = "boto3", specifier = ">=1.40.30" },
     { name = "click", specifier = ">=8.2.1" },
     { name = "fastapi", specifier = ">=0.116.1" },
-    { name = "flask", specifier = ">=3.1.2" },
     { name = "humanfriendly", specifier = ">=10.0" },
-    { name = "jinja2", specifier = ">=3.1.6" },
     { name = "mangum", specifier = ">=0.19.0" },
     { name = "pydantic-settings", specifier = ">=2.10.1" },
     { name = "python-dotenv", specifier = ">=1.1.1" },
@@ -391,23 +378,6 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/42/14/42b2651a2f46b022ccd948bca9f2d5af0fd8929c4eec235b8d6d844fbe67/filelock-3.19.1-py3-none-any.whl", hash = "sha256:d38e30481def20772f5baf097c122c3babc4fcdb7e14e57049eb9d88c6dc017d", size = 15988, upload-time = "2025-08-14T16:56:01.633Z" },
 ]
 
-[[package]]
-name = "flask"
-version = "3.1.2"
-source = { registry = "https://pypi.org/simple" }
-dependencies = [
-    { name = "blinker" },
-    { name = "click" },
-    { name = "itsdangerous" },
-    { name = "jinja2" },
-    { name = "markupsafe" },
-    { name = "werkzeug" },
-]
-sdist = { url = "https://files.pythonhosted.org/packages/dc/6d/cfe3c0fcc5e477df242b98bfe186a4c34357b4847e87ecaef04507332dab/flask-3.1.2.tar.gz", hash = "sha256:bf656c15c80190ed628ad08cdfd3aaa35beb087855e2f494910aa3774cc4fd87", size = 720160, upload-time = "2025-08-19T21:03:21.205Z" }
-wheels = [
-    { url = "https://files.pythonhosted.org/packages/ec/f9/7f9263c5695f4bd0023734af91bedb2ff8209e8de6ead162f35d8dc762fd/flask-3.1.2-py3-none-any.whl", hash = "sha256:ca1d8112ec8a6158cc29ea4858963350011b5c846a414cdb7a954aa9e967d03c", size = 103308, upload-time = "2025-08-19T21:03:19.499Z" },
-]
-
 [[package]]
 name = "frozenlist"
 version = "1.7.0"
@@ -542,15 +512,6 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/2c/e1/e6716421ea10d38022b952c159d5161ca1193197fb744506875fbb87ea7b/iniconfig-2.1.0-py3-none-any.whl", hash = "sha256:9deba5723312380e77435581c6bf4935c94cbfab9b1ed33ef8d238ea168eb760", size = 6050, upload-time = "2025-03-19T20:10:01.071Z" },
 ]
 
-[[package]]
-name = "itsdangerous"
-version = "2.2.0"
-source = { registry = "https://pypi.org/simple" }
-sdist = { url = "https://files.pythonhosted.org/packages/9c/cb/8ac0172223afbccb63986cc25049b154ecfb5e85932587206f42317be31d/itsdangerous-2.2.0.tar.gz", hash = "sha256:e0050c0b7da1eea53ffaf149c0cfbb5c6e2e2b69c4bef22c81fa6eb73e5f6173", size = 54410, upload-time = "2024-04-16T21:28:15.614Z" }
-wheels = [
-    { url = "https://files.pythonhosted.org/packages/04/96/92447566d16df59b2a776c0fb82dbc4d9e07cd95062562af01e408583fc4/itsdangerous-2.2.0-py3-none-any.whl", hash = "sha256:c6242fc49e35958c8b15141343aa660db5fc54d4f13a1db01a3f5891b98700ef", size = 16234, upload-time = "2024-04-16T21:28:14.499Z" },
-]
-
 [[package]]
 name = "jinja2"
 version = "3.1.6"
@@ -1326,18 +1287,6 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/fa/a8/5b41e0da817d64113292ab1f8247140aac61cbf6cfd085d6a0fa77f4984f/websockets-15.0.1-py3-none-any.whl", hash = "sha256:f7a866fbc1e97b5c617ee4116daaa09b722101d4a3c170c787450ba409f9736f", size = 169743, upload-time = "2025-03-05T20:03:39.41Z" },
 ]
 
-[[package]]
-name = "werkzeug"
-version = "3.1.3"
-source = { registry = "https://pypi.org/simple" }
-dependencies = [
-    { name = "markupsafe" },
-]
-sdist = { url = "https://files.pythonhosted.org/packages/9f/69/83029f1f6300c5fb2471d621ab06f6ec6b3324685a2ce0f9777fd4a8b71e/werkzeug-3.1.3.tar.gz", hash = "sha256:60723ce945c19328679790e3282cc758aa4a6040e4bb330f53d30fa546d44746", size = 806925, upload-time = "2024-11-08T15:52:18.093Z" }
-wheels = [
-    { url = "https://files.pythonhosted.org/packages/52/24/ab44c871b0f07f491e5d2ad12c9bd7358e527510618cb1b803a88e986db1/werkzeug-3.1.3-py3-none-any.whl", hash = "sha256:54b78bf3716d19a65be4fceccc0d1d7b89e608834989dfae50ea87564639213e", size = 224498, upload-time = "2024-11-08T15:52:16.132Z" },
-]
-
 [[package]]
 name = "yarl"
 version = "1.20.1"