auto-optimize

AutoResearch for Performance Engineering

Measure first. Reason deep. Reflect. Repeat.

Andrej Karpathy introduced the idea of autoresearch — closing the loop between hypothesis, experiment, and reflection so that an AI agent can drive an entire research cycle autonomously. auto-optimize applies that idea to performance engineering.

You define a numeric goal and a success threshold. The plugin builds regression and benchmark infrastructure, locks a baseline, then runs an autonomous loop: profile → reason → plan → apply → test → measure → reflect → repeat. Every iteration is a git commit. Every decision is reasoned, recorded, and fed back into the next cycle.

Installation

claude plugin marketplace add bluuewhale/auto-optimize
claude plugin install auto-optimize@auto-optimize

Quick Start

/auto-optimize The API is slow. I want to make it faster.

auto-optimize will ask a few clarifying questions — metric, scope, success target, and test commands — then take it from there. If benchmarks or regression tests are missing, it writes them before starting the loop.

Real-World Result: 27% Faster Hash Table

Full write-up: HashSmith Part 3 — I Automated My Way to a 27% Faster Hash Table

HashSmith is an open-source high-performance hash table for the JVM — a SwissTable-style map built around SWAR probing and 8-byte control word groups. After two rounds of manual optimization, the author handed the profiler to auto-optimize.

One prompt. ~3 hours. No manual intervention.

/auto-optimize I want to optimize the get/put performance of the SwissMap implementation.


Experiments run	5
Optimizations landed	3
Dropped	2
Improvement vs baseline	13–32% across all 8 benchmark scenarios

What the agent found

The agent ran 5 experiments autonomously. Three compounding wins, in order:

Tombstone guard — the probe loop was carrying tombstone-handling logic on a path where tombstones essentially never exist in production. Splitting into two specialized loop bodies eliminated the dead weight. Put path: -19% to -45%.
ILP hoisting on the read path — emptyMask was being computed after the key-equality loop, creating a serial dependency. Moving it adjacent to eqMask let the CPU's out-of-order engine pipeline both SWAR operations in the same clock cycle. Get path: -11% to -36%.
A third, smaller improvement compounded on top of both.

None of these required a single line of code written by the author. The structured reasoning pipeline (Step-Back → CoT → Self-Consistency → Pre-mortem) found the tombstone fast path by asking what is this loop doing that it doesn't need to do? — a question that wasn't visible in the disassembly alone.

The Problem

Most optimization attempts fail silently:

You change code, feel like it's faster, ship it — but never measured before or after
You write a quick benchmark once, optimize for it, then lose the script
You try five approaches, forget what you tried, and repeat the same dead-ends

auto-optimize enforces the discipline you know you should have but don't.

How It Works

Phase	What Happens	Output
0. Gather	Collects goal, scope, metric direction, and numeric success criteria	`experiment-plan.md`
1. Infra	Builds Regression Test and Benchmark Test scripts if missing (parallel sub-agents)	`tests/` + `bench/`
1.5 Baseline	Locks noise-floor-validated baseline measurement and environment snapshot	`baseline/`
2. Loop	Profile → Disassemble → Reason (Opus) → Apply → Test → Benchmark → Reflect	`iterations/` + `leaderboard.md`
3. Report	Summarizes all iterations, best config, and recommended next steps	`final-report.md`

Every iteration is a git commit — including reverts. The full experiment history is always recoverable.

The Intelligence Layer

Most AI coding tools apply changes and hope for the best. auto-optimize's inner loop is built differently — each iteration runs a structured reasoning pipeline powered by Claude Opus before a single line of code is touched.

Multi-technique Reasoning per Iteration

Every iteration delegates planning to a dedicated Opus sub-agent that applies four reasoning techniques in sequence:

Technique	What It Does
Step-Back	Abstracts the bottleneck type first (CPU-bound / I/O-bound / Memory / Network) and derives general solution principles before looking at specifics
Chain-of-Thought	Enumerates at least 3 candidate approaches with explicit trade-off analysis, then argues for the chosen strategy with an expected improvement estimate
Self-Consistency	Re-evaluates all candidates independently, as if seeing them for the first time — selects only the strategy that wins both evaluations
Pre-mortem	Assumes the chosen strategy already failed and writes the post-mortem — identifying the most likely failure mode and the early signal that would confirm it

This isn't prompt engineering for show. It's a guard against the most common failure mode of autonomous agents: plausible-sounding decisions that haven't been stress-tested.

Deep Profiling + Disassembly Analysis

Before planning, the executor sub-agent re-profiles the current codebase to detect whether the bottleneck has shifted from the previous iteration. When source-level analysis is insufficient — or when previous iterations show diminishing returns — it drops into disassembly analysis:

Native code (C/C++/Rust/Go): checks for missed SIMD vectorization, failed inlining, redundant memory ops
Python: dis module bytecode analysis to spot inefficient interpreter paths
JVM: javap bytecode inspection

Findings are annotated directly into plan.md with specific instruction or bytecode offsets, giving the planner sub-agent ground-truth evidence rather than guesses.

Self-Evolving via Reflexion

After every iteration — whether it succeeded or was reverted — the executor writes a reflexion.md:

What was surprising about this iteration's outcome
Concrete lessons for the next iteration (not vague impressions)
Promising unexplored directions spotted during execution

The next iteration reads this reflexion before profiling or planning. The agent gets smarter with each cycle, accumulating a growing body of experiment-specific knowledge that no general-purpose AI has access to.

This is the difference between an agent that runs N experiments and an agent that runs N increasingly informed experiments.

Core Principles

No measurement, no optimization. The loop cannot start until both the Regression Test and Benchmark Test commands are confirmed working. Ambiguous success criteria ("as fast as possible") are rejected — a specific number is required. Baseline measurement includes a noise-floor check: if variance across 5 runs exceeds 5%, the noise source must be identified before locking the baseline.

Plan before every touch. No code is changed until a written plan exists in plan.md. The plan is produced by a dedicated Opus planner sub-agent using the full reasoning pipeline above — not inlined into the executor's context.

Sub-agents isolate context. Each iteration runs in a dedicated sub-agent. Raw profiling output, disassembly, diffs, and test logs never pollute the main context. The main loop sees only a structured return value: result, delta, decision, and next hint. No /clear needed.

Failure is data. Every attempt is recorded — including the ones that didn't make the cut. The leaderboard tracks what worked, what didn't, and why. The reflexion from a failed iteration is often more valuable than the result from a successful one.

Example Leaderboard

After a few iterations, experiments/exp-001-api-latency/leaderboard.md looks like:

# Experiment Leaderboard: exp-001-api-latency
Last updated: 2026-04-04

| Rank | Iteration              | Result | Delta  | Strategy                    | Status     |
|------|------------------------|--------|--------|-----------------------------|------------|
| 1    | iter-003-query-cache   | 178ms  | -162ms | Cache repeated DB lookups   | ✅ 🎯      |
| 2    | iter-001-index-add     | 295ms  | -45ms  | Add composite index         | ✅          |
| 3    | baseline               | 340ms  | —      | —                           | Baseline   |
| 4    | iter-002-async-io      | 360ms  | +20ms  | Async conversion (backfired)| ❌          |

Experiment Structure

All experiments are tracked under experiments/ in your project:

experiments/
  exp-001-api-latency/
    experiment-plan.md        # Goal, scope, metric, success criteria
    tests/                    # Regression Test scripts (never modified by optimizer)
    bench/                    # Benchmark Test scripts (never modified by optimizer)
    baseline/
      result.txt              # Noise-floor-validated baseline (median of 5 runs)
      env-snapshot.txt        # OS, runtime, dependency versions at baseline time
    iterations/
      iter-001-index-add/
        profile-snapshot.txt  # Profiler output for this iteration
        disasm-snapshot.txt   # Disassembly output (when triggered)
        plan.md               # Step-Back + CoT + Self-Consistency + Pre-mortem (Opus)
        result.txt            # Benchmark Test output
        summary.md            # Delta vs baseline, strategy summary
        reflexion.md          # What was learned, concrete hints for next iteration
    leaderboard.md            # Ranked results, updated after every iteration
    final-report.md           # Written after loop ends

Requirements

Claude Code with sub-agent support
Git (all iterations are committed automatically)
A project with at least one measurable numeric metric

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.claude-plugin		.claude-plugin
skills/auto-optimize		skills/auto-optimize
README.md		README.md
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

auto-optimize

Installation

Quick Start

Real-World Result: 27% Faster Hash Table

What the agent found

The Problem

How It Works

The Intelligence Layer

Multi-technique Reasoning per Iteration

Deep Profiling + Disassembly Analysis

Self-Evolving via Reflexion

Core Principles

Example Leaderboard

Experiment Structure

Requirements

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

auto-optimize

Installation

Quick Start

Real-World Result: 27% Faster Hash Table

What the agent found

The Problem

How It Works

The Intelligence Layer

Multi-technique Reasoning per Iteration

Deep Profiling + Disassembly Analysis

Self-Evolving via Reflexion

Core Principles

Example Leaderboard

Experiment Structure

Requirements

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages