Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
a35b5b3
feat: add /canary, /benchmark, /land-and-deploy skills (v0.7.0)
garrytan Mar 18, 2026
58907e7
feat: add Performance & Bundle Impact category to review checklist
garrytan Mar 18, 2026
3c3be2d
feat: add {{DEPLOY_BOOTSTRAP}} resolver + deployed row in dashboard
garrytan Mar 18, 2026
c46ada1
chore: mark 3 TODOs completed, bump v0.7.0, update CHANGELOG
garrytan Mar 18, 2026
8064116
merge: resolve conflicts with origin/main (v0.6.4.1)
garrytan Mar 18, 2026
0fcf561
merge: resolve conflicts with origin/main (v0.8.2 → v0.9.0)
garrytan Mar 19, 2026
e483c95
merge: resolve conflicts with origin/main (v0.9.0.1 → v0.9.1)
garrytan Mar 20, 2026
198cd2d
feat: /setup-deploy skill + platform-specific deploy verification
garrytan Mar 20, 2026
0d1d2e9
test: E2E + LLM-judge evals for deploy skills
garrytan Mar 20, 2026
17276b3
merge: resolve conflicts with origin/main (v0.9.1.0 → v0.9.1)
garrytan Mar 20, 2026
28deff3
fix: harden E2E tests — server lifecycle, timeouts, preamble budget, …
garrytan Mar 21, 2026
f30150b
test: redesign 6 skipped/todo E2E tests + add test.concurrent support
garrytan Mar 21, 2026
641ea32
fix: relax contributor-mode assertions — test structure not exact phr…
garrytan Mar 21, 2026
a25c7b7
perf: enable test.concurrent for 31 independent E2E tests
garrytan Mar 21, 2026
2c8e8f7
fix: add --concurrent flag to bun test + convert remaining 4 sequenti…
garrytan Mar 21, 2026
2b9b286
perf: split monolithic E2E test into 8 parallel files
garrytan Mar 21, 2026
d68a70d
perf: bump default E2E concurrency to 15
garrytan Mar 21, 2026
d442aad
perf: add model pinning infrastructure + rate-limit telemetry to E2E …
garrytan Mar 21, 2026
ce4a576
fix: resolve 3 E2E test failures — tmpdir race, wasted turns, brittle…
garrytan Mar 21, 2026
fa61e2f
perf: pin quality tests to Opus, add --retry 2 and test:e2e:fast tier
garrytan Mar 21, 2026
2bb6df7
docs: mark E2E model pinning TODO as shipped
garrytan Mar 21, 2026
7352e8e
merge: resolve conflicts with origin/main
garrytan Mar 21, 2026
827f635
docs: add SKILL.md merge conflict directive to CLAUDE.md
garrytan Mar 21, 2026
4c17c01
fix: add DEPLOY_BOOTSTRAP resolver to gen-skill-docs
garrytan Mar 21, 2026
7bed81f
chore: regenerate SKILL.md files after DEPLOY_BOOTSTRAP fix
garrytan Mar 21, 2026
08f4937
fix: move prompt temp file outside workingDirectory to prevent race c…
garrytan Mar 21, 2026
72cc4b7
fix: add --retry 2 --concurrent flags to test:evals scripts for consi…
garrytan Mar 21, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
467 changes: 467 additions & 0 deletions .agents/skills/gstack-benchmark/SKILL.md

Large diffs are not rendered by default.

471 changes: 471 additions & 0 deletions .agents/skills/gstack-canary/SKILL.md

Large diffs are not rendered by default.

685 changes: 685 additions & 0 deletions .agents/skills/gstack-land-and-deploy/SKILL.md

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion .agents/skills/gstack-review/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -335,7 +335,7 @@ Run `git diff origin/<base>` to get the full diff. This includes both committed
Apply the checklist against the diff in two passes:

1. **Pass 1 (CRITICAL):** SQL & Data Safety, Race Conditions & Concurrency, LLM Output Trust Boundary, Enum & Value Completeness
2. **Pass 2 (INFORMATIONAL):** Conditional Side Effects, Magic Numbers & String Coupling, Dead Code & Consistency, LLM Prompt Issues, Test Gaps, View/Frontend
2. **Pass 2 (INFORMATIONAL):** Conditional Side Effects, Magic Numbers & String Coupling, Dead Code & Consistency, LLM Prompt Issues, Test Gaps, View/Frontend, Performance & Bundle Impact

**Enum & Value Completeness requires reading code OUTSIDE the diff.** When the diff introduces a new enum value, status, tier, or type constant, use Grep to find all files that reference sibling values, then Read those files to check if the new value is handled. This is the one category where within-diff review is insufficient.

Expand Down
435 changes: 435 additions & 0 deletions .agents/skills/gstack-setup-deploy/SKILL.md

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -285,7 +285,7 @@ The `parseNDJSON()` function is pure — no I/O, no side effects — making it i
### Observability data flow

```
skill-e2e.test.ts
skill-e2e-*.test.ts
│ generates runId, passes testName + runId to each call
Expand Down
21 changes: 20 additions & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ gstack/
│ ├── skill-validation.test.ts # Tier 1: static validation (free, <1s)
│ ├── gen-skill-docs.test.ts # Tier 1: generator quality (free, <1s)
│ ├── skill-llm-eval.test.ts # Tier 3: LLM-as-judge (~$0.15/run)
│ └── skill-e2e.test.ts # Tier 2: E2E via claude -p (~$3.85/run)
│ └── skill-e2e-*.test.ts # Tier 2: E2E via claude -p (~$3.85/run, split by category)
├── qa-only/ # /qa-only skill (report-only QA, no fixes)
├── plan-design-review/ # /plan-design-review skill (report-only design audit)
├── design-review/ # /design-review skill (design audit + fix loop)
Expand Down Expand Up @@ -93,6 +93,12 @@ SKILL.md files are **generated** from `.tmpl` templates. To update docs:
To add a new browse command: add it to `browse/src/commands.ts` and rebuild.
To add a snapshot flag: add it to `SNAPSHOT_FLAGS` in `browse/src/snapshot.ts` and rebuild.

**Merge conflicts on SKILL.md files:** NEVER resolve conflicts on generated SKILL.md
files by accepting either side. Instead: (1) resolve conflicts on the `.tmpl` templates
and `scripts/gen-skill-docs.ts` (the sources of truth), (2) run `bun run gen:skill-docs`
to regenerate all SKILL.md files, (3) stage the regenerated files. Accepting one side's
generated output silently drops the other side's template changes.

## Platform-agnostic design

Skills must NEVER hardcode framework-specific commands, file patterns, or directory
Expand Down Expand Up @@ -227,6 +233,19 @@ regenerated SKILL.md shifts prompt context.

"Pre-existing" without receipts is a lazy claim. Prove it or don't say it.

## Long-running tasks: don't give up

When running evals, E2E tests, or any long-running background task, **poll until
completion**. Use `sleep 180 && echo "ready"` + `TaskOutput` in a loop every 3
minutes. Never switch to blocking mode and give up when the poll times out. Never
say "I'll be notified when it completes" and stop checking — keep the loop going
until the task finishes or the user tells you to stop.

The full E2E suite can take 30-45 minutes. That's 10-15 polling cycles. Do all of
them. Report progress at each check (which tests passed, which are running, any
failures so far). The user wants to see the run complete, not a promise that
you'll check later.

## Deploying to the active skill

The active skill lives at `~/.claude/skills/gstack/`. After making changes:
Expand Down
4 changes: 2 additions & 2 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -145,15 +145,15 @@ Spawns `claude -p` as a subprocess with `--output-format stream-json --verbose`,

```bash
# Must run from a plain terminal — can't nest inside Claude Code or Conductor
EVALS=1 bun test test/skill-e2e.test.ts
EVALS=1 bun test test/skill-e2e-*.test.ts
```

- Gated by `EVALS=1` env var (prevents accidental expensive runs)
- Auto-skips if running inside Claude Code (`claude -p` can't nest)
- API connectivity pre-check — fails fast on ConnectionRefused before burning budget
- Real-time progress to stderr: `[Ns] turn T tool #C: Name(...)`
- Saves full NDJSON transcripts and failure JSON for debugging
- Tests live in `test/skill-e2e.test.ts`, runner logic in `test/helpers/session-runner.ts`
- Tests live in `test/skill-e2e-*.test.ts` (split by category), runner logic in `test/helpers/session-runner.ts`

### E2E observability

Expand Down
45 changes: 9 additions & 36 deletions TODOS.md
Original file line number Diff line number Diff line change
Expand Up @@ -177,17 +177,6 @@
**Priority:** P2
**Depends on:** None

### Post-deploy verification (ship + browse)

**What:** After push, browse staging/preview URL, screenshot key pages, check console for JS errors, compare staging vs prod via snapshot diff. Include verification screenshots in PR body. STOP if critical errors found.

**Why:** Catch deployment-time regressions (JS errors, broken layouts) before merge.

**Context:** Requires S3 upload infrastructure for PR screenshots. Pairs with visual PR annotations.

**Effort:** L
**Priority:** P2
**Depends on:** /setup-gstack-upload, visual PR annotations

### Visual verification with screenshots in PR body

Expand Down Expand Up @@ -348,14 +337,6 @@
**Priority:** P3
**Depends on:** Video recording

### Deploy-verify skill

**What:** Lightweight post-deploy smoke test: hit key URLs, verify 200s, screenshot critical pages, console error check, compare against baseline snapshots. Pass/fail with evidence.

**Why:** Fast post-deploy confidence check, separate from full QA.

**Effort:** M
**Priority:** P2

### GitHub Actions eval upload

Expand All @@ -369,14 +350,11 @@
**Priority:** P2
**Depends on:** Eval persistence (shipped in v0.3.6)

### E2E model pinning

**What:** Pin E2E tests to claude-sonnet-4-6 for cost efficiency, add retry:2 for flaky LLM responses.
### E2E model pinning — SHIPPED

**Why:** Reduce E2E test cost and flakiness.
~~**What:** Pin E2E tests to claude-sonnet-4-6 for cost efficiency, add retry:2 for flaky LLM responses.~~

**Effort:** XS
**Priority:** P2
Shipped: Default model changed to Sonnet for structure tests (~30), Opus retained for quality tests (~10). `--retry 2` added. `EVALS_MODEL` env var for override. `test:e2e:fast` tier added. Rate-limit telemetry (first_response_ms, max_inter_turn_ms) and wall_clock_ms tracking added to eval-store.

### Eval web dashboard

Expand Down Expand Up @@ -486,17 +464,6 @@ Shipped in v0.8.3. Step 8.5 added to `/ship` — after creating the PR, `/ship`
**Priority:** P3
**Depends on:** gstack-diff-scope (shipped)

### /merge skill — review-gated PR merge

**What:** Create a `/merge` skill that merges an approved PR, but first checks the Review Readiness Dashboard and runs `/review` (Fix-First) if code review hasn't been done. Separates "ship" (create PR) from "merge" (land it).

**Why:** Currently `/review` runs inside `/ship` Step 3.5 but isn't tracked as a gate. A `/merge` skill ensures code review always happens before landing, and enables workflows where someone else reviews the PR first.

**Context:** `/ship` creates the PR. `/merge` would: check dashboard → run `/review` if needed → `gh pr merge`. This is where code review tracking belongs — at merge time, not at plan time.

**Effort:** M
**Priority:** P2
**Depends on:** Ship Confidence Dashboard (shipped)

## Completeness

Expand Down Expand Up @@ -548,6 +515,12 @@ Shipped in v0.6.5. TemplateContext in gen-skill-docs.ts bakes skill name into pr

## Completed

### Deploy pipeline (v0.7.0)
- /merge skill — review-gated PR merge → superseded by /land-and-deploy
- Deploy-verify skill → superseded by /land-and-deploy canary verification
- Post-deploy verification (ship + browse) → superseded by /land-and-deploy
**Completed:** v0.7.0

### Phase 1: Foundations (v0.2.0)
- Rename to gstack
- Restructure to monorepo layout
Expand Down
Loading