feat: improve running-tests skill score (64% → 90%) by yogesh-tessl · Pull Request #5269 · stellar/stellar-core

yogesh-tessl · 2026-05-14T03:38:01Z

truly impressive work. 31 skills mapping every subsystem from SCP consensus to Soroban smart contracts to ledger and bucket management. The level of detail in breaking down each subsystem into its own reviewable skill is impressive, and the existing claude-review.yml with that thorough security model shows you take AI tooling seriously.

ran your skills through tessl skill review at work and found some targeted improvements for the running-tests skill. Here's the before/after:

Skill	Before	After	Change
running-tests	64%	90%	+26%

Changes made

Description rewrite (biggest impact - 33% → 100%):

Expanded from a vague 82-character label to a full description listing concrete capabilities: Catch2 test suite execution, progressive test levels, protocol version matrix, tx-meta baseline checks, and sanitizer builds (ASan/TSan/UBSan)
Added explicit "Use when..." clause with natural trigger terms ("run tests", "verify a change", "check if tests pass", "run unit tests", "run regression tests", "validate code before a PR")
Added domain-specific distinctiveness (stellar-core, Catch2) so the skill won't conflict with other testing-related skills

Content trimming:

Removed the generic "Interpreting Failures" and "Common Failure Patterns" sections - Claude already understands what assertion failures, segfaults, timeouts, and sanitizer errors are. The stellar-core-specific diagnostic guidance (e.g., --ll debug, re-run with ASan) is already covered in the level descriptions themselves

Unchanged:

All domain-specific content preserved: test tag catalog, protocol version flags, tx-meta baseline commands, sanitizer configure sequences, parallel execution patterns, ALWAYS/NEVER guardrails, subagent input/output format

also stress-tested your running-tests skill against a few real-world task evals and it held up really well on multi-level test progression with --all-versions protocol matrix and --rng-seed 12345 tx-meta baseline verification. Kudos for that.

quick honest disclosure. I work at https://github.com/tesslio where we build tooling around skills like these. Not a pitch, just saw room for improvement and wanted to contribute.

if you want to self-improve your skills, or define your own scenarios to pressure test, just ask your agent (Claude Code, Codex, etc.) to evaluate and optimize your skill with Tessl. Ping me @yogesh-tessl, if you hit any snags.

@graydon

Hey @graydon 👋 # Description I ran your skills through `tessl skill review` at work and found some targeted improvements for the `running-tests` skill. Here's the before/after: | Skill | Before | After | Change | |-------|--------|-------|--------| | running-tests | 64% | 90% | +26% | <details> <summary>Changes made</summary> **Description rewrite (biggest impact — 33% → 100%):** - Expanded from a vague 82-character label to a full description listing concrete capabilities: Catch2 test suite execution, progressive test levels, protocol version matrix, tx-meta baseline checks, and sanitizer builds (ASan/TSan/UBSan) - Added explicit "Use when..." clause with natural trigger terms ("run tests", "verify a change", "check if tests pass", "run unit tests", "run regression tests", "validate code before a PR") - Added domain-specific distinctiveness (stellar-core, Catch2) so the skill won't conflict with other testing-related skills **Content trimming:** - Removed the generic "Interpreting Failures" and "Common Failure Patterns" sections — Claude already understands what assertion failures, segfaults, timeouts, and sanitizer errors are. The stellar-core-specific diagnostic guidance (e.g., `--ll debug`, re-run with ASan) is already covered in the level descriptions themselves **Unchanged:** - All domain-specific content preserved: test tag catalog, protocol version flags, tx-meta baseline commands, sanitizer configure sequences, parallel execution patterns, ALWAYS/NEVER guardrails, subagent input/output format </details> I also stress-tested your `running-tests` skill against a few real-world task evals and it held up really well on multi-level test progression with `--all-versions` protocol matrix and `--rng-seed 12345` tx-meta baseline verification. Kudos for that. Honest disclosure — I work at @tesslio where we build tooling around skills like these. Not a pitch — just saw room for improvement and wanted to contribute. Want to self-improve your skills? Just point your agent (Claude Code, Codex, etc.) at [this Tessl guide](https://docs.tessl.io/evaluate/optimize-a-skill-using-best-practices) and ask it to optimize your skill. Ping me — [@yogesh-tessl](https://github.com/yogesh-tessl) — if you hit any snags. # Checklist - [x] Reviewed the [contributing](https://github.com/stellar/stellar-core/blob/master/CONTRIBUTING.md#submitting-changes) document - [x] Rebased on top of master (no merge commits) - [ ] ~~Ran `clang-format` v8.0.0~~ (N/A — SKILL.md only, no code changes) - [ ] ~~Compiles~~ (N/A — SKILL.md only) - [ ] ~~Ran all tests~~ (N/A — SKILL.md only) - [ ] ~~If change impacts performance~~ (N/A — SKILL.md only) Thanks in advance 🙏

Copilot

Pull request overview

This PR improves the running-tests Claude skill metadata so agents can more accurately discover and invoke it for stellar-core test execution workflows, while trimming generic failure interpretation guidance.

Changes:

Expands the skill description with concrete stellar-core/Catch2 testing capabilities and trigger phrases.
Removes generic failure interpretation and common failure-pattern sections.

Copilot AI review requested due to automatic review settings May 14, 2026 03:38

Copilot started reviewing on behalf of yogesh-tessl May 14, 2026 03:38 View session

Copilot AI reviewed May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: improve running-tests skill score (64% → 90%)#5269

feat: improve running-tests skill score (64% → 90%)#5269
yogesh-tessl wants to merge 1 commit into
stellar:masterfrom
yogesh-tessl:improve/skill-review-optimization

yogesh-tessl commented May 14, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yogesh-tessl commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yogesh-tessl commented May 14, 2026 •

edited

Loading