Auto Skill Improver

Most prompt iteration is theater. This is how you stop lying to yourself about whether anything got better.

Auto Skill Improver is a benchmark-first loop for improving Claude Code, Cowork, and OpenClaw skills through controlled mutation and honest keep/discard decisions — not prompt vibes.

It is built for people who have real skill files or prompt specs and want to know — with evidence — whether a change actually made things better.

The Problem

The failure mode is always the same.

You mutate an instruction. You read the new version. It sounds more sophisticated, so you keep it. Nothing measurably improved.

Auto Skill Improver resists that trap by treating skill iteration as an empirical process, not an editorial one.

The Benchmark-Mutation Loop

The core method — directly inspired by Andrej Karpathy's autoresearch work — applies the same measurement discipline to skill files and prompt specs:

Classify the skill type
Build a benchmark that can actually distinguish success from failure
Establish a baseline before you change anything
Mutate one thing at a time
Keep only what improves the score
Stop when the benchmark saturates or stops teaching you anything

The principle we borrowed from Karpathy is not "let an AI improve itself." It is the more disciplined idea underneath it: use a bounded environment, a real benchmark, a repeatable loop, and make decisions based on evidence — not instinct.

Where It Works Best

Auto Skill Improver produces real signal on skills with:

Ambiguous output contracts — the skill doesn't clearly define what good looks like
Missing fallback behavior — the skill collapses when a dependency is unavailable
Conflicting instruction layers — one rule contradicts another
Dependency or portability problems — the skill breaks outside its original environment
Weak evidence discipline — the skill invents certainty when data is missing
Structural formatting rules — output shape can be checked programmatically

Worked examples

Missing fallback: A skill relies on a reference file for its core quality check. If the file is missing, the skill's main promise collapses. A fallback mutation produces a real, testable improvement.

Conflicting rules: One skill says "close with a statement" while another encourages question endings. Adding an explicit priority rule is a real, measurable change.

Benchmark saturation: A compact reference skill that already lists the necessary commands may saturate the benchmark immediately. The right answer is to stop — and that is still useful information.

Where It Does Not Fit

Be realistic about scope.

Auto Skill Improver is a poor fit when:

The skill is already small, crisp, and adequate
The only thing you care about is subjective taste
You cannot build a meaningful benchmark
The benchmark saturates immediately and tightening does not help
The target is a high-risk orchestration layer without proxy metrics

It can surface missing decision logic, broken dependencies, contradictory instructions, and weak output contracts. It cannot solve taste, deep product judgment, audience resonance, or real-world performance without real-world testing.

What It Finds

Missing decision logic
Broken dependencies
Contradictory instructions
Weak output contracts
Evaluator saturation

What It Cannot Fix

Taste
Deep product judgment
Audience resonance
Real-world performance without real-world testing

How To Use It

Read docs/TRIAGE.md
Pick a target skill
Classify the skill type
Build a small benchmark
Run a baseline — before you change anything
Tighten the benchmark if it saturates immediately
Mutate one thing
Re-run and review the diff
Keep or discard honestly

Honest Limitations

Same-model evaluation can be circular. Formatting quality is hard to measure with binary checks alone. Some skills require proxy benchmarks rather than direct scoring. Portability improvements can conflict with platform-specific optimization.

Read docs/LIMITATIONS.md before treating any score as definitive.

A binary eval can still be useless. The goal is not just yes/no questions — it is yes/no questions tied to task success.

UX and Run Model

The best UX for Auto Skill Improver is a research-run workflow, not a live dashboard.

Users should stay in the loop through decision checkpoints:

baseline complete
benchmark saturated
mutation kept
regression detected
final summary ready

Best default mode:

supervised batch run
baseline + 1-3 mutations + summary + human decision

Best operational trigger:

run on skill changes, model changes, repeated failures, or dependency breakage
do not run blind timer-based optimization on everything

Read:

docs/UX-MODEL.md
docs/RUN-MODES.md

Repo Structure

skills/       — optimizer specs for different execution environments
docs/         — limitations, triage, eval patterns, context intake, and run guidance
harnesses/    — public-safe template harnesses
examples/     — synthetic worked examples

Key Principles

If you mutate before measuring the starting point, you are doing theater, not research.
If you change five things at once, you lose attribution.
If the baseline already scores near-perfectly, your skill may already be good — or your benchmark is weak. Either way, do not claim improvement.
Longer loops are not always better loops.
The point is not to make skills longer. The point is to make them clearer, more robust, and more measurably useful.
If the benchmark says nothing improved, that is still useful information.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
docs		docs
examples		examples
harnesses		harnesses
skills		skills
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Auto Skill Improver

The Problem

The Benchmark-Mutation Loop

Where It Works Best

Worked examples

Where It Does Not Fit

What It Finds

What It Cannot Fix

How To Use It

Honest Limitations

UX and Run Model

Repo Structure

Key Principles

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Auto Skill Improver

The Problem

The Benchmark-Mutation Loop

Where It Works Best

Worked examples

Where It Does Not Fit

What It Finds

What It Cannot Fix

How To Use It

Honest Limitations

UX and Run Model

Repo Structure

Key Principles

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages