feat: improve running-tests skill score (64% → 90%)#5269
Open
yogesh-tessl wants to merge 1 commit into
Open
Conversation
Hey @graydon 👋 # Description I ran your skills through `tessl skill review` at work and found some targeted improvements for the `running-tests` skill. Here's the before/after: | Skill | Before | After | Change | |-------|--------|-------|--------| | running-tests | 64% | 90% | +26% | <details> <summary>Changes made</summary> **Description rewrite (biggest impact — 33% → 100%):** - Expanded from a vague 82-character label to a full description listing concrete capabilities: Catch2 test suite execution, progressive test levels, protocol version matrix, tx-meta baseline checks, and sanitizer builds (ASan/TSan/UBSan) - Added explicit "Use when..." clause with natural trigger terms ("run tests", "verify a change", "check if tests pass", "run unit tests", "run regression tests", "validate code before a PR") - Added domain-specific distinctiveness (stellar-core, Catch2) so the skill won't conflict with other testing-related skills **Content trimming:** - Removed the generic "Interpreting Failures" and "Common Failure Patterns" sections — Claude already understands what assertion failures, segfaults, timeouts, and sanitizer errors are. The stellar-core-specific diagnostic guidance (e.g., `--ll debug`, re-run with ASan) is already covered in the level descriptions themselves **Unchanged:** - All domain-specific content preserved: test tag catalog, protocol version flags, tx-meta baseline commands, sanitizer configure sequences, parallel execution patterns, ALWAYS/NEVER guardrails, subagent input/output format </details> I also stress-tested your `running-tests` skill against a few real-world task evals and it held up really well on multi-level test progression with `--all-versions` protocol matrix and `--rng-seed 12345` tx-meta baseline verification. Kudos for that. Honest disclosure — I work at @tesslio where we build tooling around skills like these. Not a pitch — just saw room for improvement and wanted to contribute. Want to self-improve your skills? Just point your agent (Claude Code, Codex, etc.) at [this Tessl guide](https://docs.tessl.io/evaluate/optimize-a-skill-using-best-practices) and ask it to optimize your skill. Ping me — [@yogesh-tessl](https://github.com/yogesh-tessl) — if you hit any snags. # Checklist - [x] Reviewed the [contributing](https://github.com/stellar/stellar-core/blob/master/CONTRIBUTING.md#submitting-changes) document - [x] Rebased on top of master (no merge commits) - [ ] ~~Ran `clang-format` v8.0.0~~ (N/A — SKILL.md only, no code changes) - [ ] ~~Compiles~~ (N/A — SKILL.md only) - [ ] ~~Ran all tests~~ (N/A — SKILL.md only) - [ ] ~~If change impacts performance~~ (N/A — SKILL.md only) Thanks in advance 🙏
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves the running-tests Claude skill metadata so agents can more accurately discover and invoke it for stellar-core test execution workflows, while trimming generic failure interpretation guidance.
Changes:
- Expands the skill description with concrete stellar-core/Catch2 testing capabilities and trigger phrases.
- Removes generic failure interpretation and common failure-pattern sections.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hey @graydon 👋
truly impressive work. 31 skills mapping every subsystem from SCP consensus to Soroban smart contracts to ledger and bucket management. The level of detail in breaking down each subsystem into its own reviewable skill is impressive, and the existing claude-review.yml with that thorough security model shows you take AI tooling seriously.
ran your skills through
tessl skill reviewat work and found some targeted improvements for therunning-testsskill. Here's the before/after:Changes made
Description rewrite (biggest impact - 33% → 100%):
Content trimming:
--ll debug, re-run with ASan) is already covered in the level descriptions themselvesUnchanged:
also stress-tested your
running-testsskill against a few real-world task evals and it held up really well on multi-level test progression with--all-versionsprotocol matrix and--rng-seed 12345tx-meta baseline verification. Kudos for that.quick honest disclosure. I work at https://github.com/tesslio where we build tooling around skills like these. Not a pitch, just saw room for improvement and wanted to contribute.
if you want to self-improve your skills, or define your own scenarios to pressure test, just ask your agent (Claude Code, Codex, etc.) to evaluate and optimize your skill with Tessl. Ping me @yogesh-tessl, if you hit any snags.