Skip to content

feat: improve running-tests skill score (64% → 90%)#5269

Open
yogesh-tessl wants to merge 1 commit into
stellar:masterfrom
yogesh-tessl:improve/skill-review-optimization
Open

feat: improve running-tests skill score (64% → 90%)#5269
yogesh-tessl wants to merge 1 commit into
stellar:masterfrom
yogesh-tessl:improve/skill-review-optimization

Conversation

@yogesh-tessl
Copy link
Copy Markdown

@yogesh-tessl yogesh-tessl commented May 14, 2026

Hey @graydon 👋

truly impressive work. 31 skills mapping every subsystem from SCP consensus to Soroban smart contracts to ledger and bucket management. The level of detail in breaking down each subsystem into its own reviewable skill is impressive, and the existing claude-review.yml with that thorough security model shows you take AI tooling seriously.

ran your skills through tessl skill review at work and found some targeted improvements for the running-tests skill. Here's the before/after:

Skill Before After Change
running-tests 64% 90% +26%
Changes made

Description rewrite (biggest impact - 33% → 100%):

  • Expanded from a vague 82-character label to a full description listing concrete capabilities: Catch2 test suite execution, progressive test levels, protocol version matrix, tx-meta baseline checks, and sanitizer builds (ASan/TSan/UBSan)
  • Added explicit "Use when..." clause with natural trigger terms ("run tests", "verify a change", "check if tests pass", "run unit tests", "run regression tests", "validate code before a PR")
  • Added domain-specific distinctiveness (stellar-core, Catch2) so the skill won't conflict with other testing-related skills

Content trimming:

  • Removed the generic "Interpreting Failures" and "Common Failure Patterns" sections - Claude already understands what assertion failures, segfaults, timeouts, and sanitizer errors are. The stellar-core-specific diagnostic guidance (e.g., --ll debug, re-run with ASan) is already covered in the level descriptions themselves

Unchanged:

  • All domain-specific content preserved: test tag catalog, protocol version flags, tx-meta baseline commands, sanitizer configure sequences, parallel execution patterns, ALWAYS/NEVER guardrails, subagent input/output format

also stress-tested your running-tests skill against a few real-world task evals and it held up really well on multi-level test progression with --all-versions protocol matrix and --rng-seed 12345 tx-meta baseline verification. Kudos for that.

quick honest disclosure. I work at https://github.com/tesslio where we build tooling around skills like these. Not a pitch, just saw room for improvement and wanted to contribute.

if you want to self-improve your skills, or define your own scenarios to pressure test, just ask your agent (Claude Code, Codex, etc.) to evaluate and optimize your skill with Tessl. Ping me @yogesh-tessl, if you hit any snags.

Hey @graydon 👋

# Description

I ran your skills through `tessl skill review` at work and found some targeted improvements for the `running-tests` skill.

Here's the before/after:

| Skill | Before | After | Change |
|-------|--------|-------|--------|
| running-tests | 64% | 90% | +26% |

<details>
<summary>Changes made</summary>

**Description rewrite (biggest impact — 33% → 100%):**
- Expanded from a vague 82-character label to a full description listing concrete capabilities: Catch2 test suite execution, progressive test levels, protocol version matrix, tx-meta baseline checks, and sanitizer builds (ASan/TSan/UBSan)
- Added explicit "Use when..." clause with natural trigger terms ("run tests", "verify a change", "check if tests pass", "run unit tests", "run regression tests", "validate code before a PR")
- Added domain-specific distinctiveness (stellar-core, Catch2) so the skill won't conflict with other testing-related skills

**Content trimming:**
- Removed the generic "Interpreting Failures" and "Common Failure Patterns" sections — Claude already understands what assertion failures, segfaults, timeouts, and sanitizer errors are. The stellar-core-specific diagnostic guidance (e.g., `--ll debug`, re-run with ASan) is already covered in the level descriptions themselves

**Unchanged:**
- All domain-specific content preserved: test tag catalog, protocol version flags, tx-meta baseline commands, sanitizer configure sequences, parallel execution patterns, ALWAYS/NEVER guardrails, subagent input/output format

</details>

I also stress-tested your `running-tests` skill against a few real-world task evals and it held up really well on multi-level test progression with `--all-versions` protocol matrix and `--rng-seed 12345` tx-meta baseline verification. Kudos for that.

Honest disclosure — I work at @tesslio where we build tooling around skills like these. Not a pitch — just saw room for improvement and wanted to contribute.

Want to self-improve your skills? Just point your agent (Claude Code, Codex, etc.) at [this Tessl guide](https://docs.tessl.io/evaluate/optimize-a-skill-using-best-practices) and ask it to optimize your skill. Ping me — [@yogesh-tessl](https://github.com/yogesh-tessl) — if you hit any snags.

# Checklist
- [x] Reviewed the [contributing](https://github.com/stellar/stellar-core/blob/master/CONTRIBUTING.md#submitting-changes) document
- [x] Rebased on top of master (no merge commits)
- [ ] ~~Ran `clang-format` v8.0.0~~ (N/A — SKILL.md only, no code changes)
- [ ] ~~Compiles~~ (N/A — SKILL.md only)
- [ ] ~~Ran all tests~~ (N/A — SKILL.md only)
- [ ] ~~If change impacts performance~~ (N/A — SKILL.md only)

Thanks in advance 🙏
Copilot AI review requested due to automatic review settings May 14, 2026 03:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the running-tests Claude skill metadata so agents can more accurately discover and invoke it for stellar-core test execution workflows, while trimming generic failure interpretation guidance.

Changes:

  • Expands the skill description with concrete stellar-core/Catch2 testing capabilities and trigger phrases.
  • Removes generic failure interpretation and common failure-pattern sections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants