garrytan · garrytan · Mar 21, 2026 · Mar 18, 2026 · Mar 18, 2026 · Mar 18, 2026
diff --git a/.agents/skills/gstack-benchmark/SKILL.md b/.agents/skills/gstack-benchmark/SKILL.md
diff --git a/.agents/skills/gstack-canary/SKILL.md b/.agents/skills/gstack-canary/SKILL.md
diff --git a/.agents/skills/gstack-land-and-deploy/SKILL.md b/.agents/skills/gstack-land-and-deploy/SKILL.md
diff --git a/.agents/skills/gstack-review/SKILL.md b/.agents/skills/gstack-review/SKILL.md
@@ -335,7 +335,7 @@ Run `git diff origin/<base>` to get the full diff. This includes both committed
 Apply the checklist against the diff in two passes:
 
 1. **Pass 1 (CRITICAL):** SQL & Data Safety, Race Conditions & Concurrency, LLM Output Trust Boundary, Enum & Value Completeness
-2. **Pass 2 (INFORMATIONAL):** Conditional Side Effects, Magic Numbers & String Coupling, Dead Code & Consistency, LLM Prompt Issues, Test Gaps, View/Frontend
+2. **Pass 2 (INFORMATIONAL):** Conditional Side Effects, Magic Numbers & String Coupling, Dead Code & Consistency, LLM Prompt Issues, Test Gaps, View/Frontend, Performance & Bundle Impact
 
 **Enum & Value Completeness requires reading code OUTSIDE the diff.** When the diff introduces a new enum value, status, tier, or type constant, use Grep to find all files that reference sibling values, then Read those files to check if the new value is handled. This is the one category where within-diff review is insufficient.
 

diff --git a/.agents/skills/gstack-setup-deploy/SKILL.md b/.agents/skills/gstack-setup-deploy/SKILL.md
diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md
@@ -285,7 +285,7 @@ The `parseNDJSON()` function is pure — no I/O, no side effects — making it i
 ### Observability data flow
 
 ```
-  skill-e2e.test.ts
+  skill-e2e-*.test.ts
         │
         │ generates runId, passes testName + runId to each call
         │

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -63,7 +63,7 @@ gstack/
 │   ├── skill-validation.test.ts  # Tier 1: static validation (free, <1s)
 │   ├── gen-skill-docs.test.ts    # Tier 1: generator quality (free, <1s)
 │   ├── skill-llm-eval.test.ts   # Tier 3: LLM-as-judge (~$0.15/run)
-│   └── skill-e2e.test.ts         # Tier 2: E2E via claude -p (~$3.85/run)
+│   └── skill-e2e-*.test.ts       # Tier 2: E2E via claude -p (~$3.85/run, split by category)
 ├── qa-only/         # /qa-only skill (report-only QA, no fixes)
 ├── plan-design-review/  # /plan-design-review skill (report-only design audit)
 ├── design-review/    # /design-review skill (design audit + fix loop)
@@ -93,6 +93,12 @@ SKILL.md files are **generated** from `.tmpl` templates. To update docs:
 To add a new browse command: add it to `browse/src/commands.ts` and rebuild.
 To add a snapshot flag: add it to `SNAPSHOT_FLAGS` in `browse/src/snapshot.ts` and rebuild.
 
+**Merge conflicts on SKILL.md files:** NEVER resolve conflicts on generated SKILL.md
+files by accepting either side. Instead: (1) resolve conflicts on the `.tmpl` templates
+and `scripts/gen-skill-docs.ts` (the sources of truth), (2) run `bun run gen:skill-docs`
+to regenerate all SKILL.md files, (3) stage the regenerated files. Accepting one side's
+generated output silently drops the other side's template changes.
+
 ## Platform-agnostic design
 
 Skills must NEVER hardcode framework-specific commands, file patterns, or directory
@@ -227,6 +233,19 @@ regenerated SKILL.md shifts prompt context.
 
 "Pre-existing" without receipts is a lazy claim. Prove it or don't say it.
 
+## Long-running tasks: don't give up
+
+When running evals, E2E tests, or any long-running background task, **poll until
+completion**. Use `sleep 180 && echo "ready"` + `TaskOutput` in a loop every 3
+minutes. Never switch to blocking mode and give up when the poll times out. Never
+say "I'll be notified when it completes" and stop checking — keep the loop going
+until the task finishes or the user tells you to stop.
+
+The full E2E suite can take 30-45 minutes. That's 10-15 polling cycles. Do all of
+them. Report progress at each check (which tests passed, which are running, any
+failures so far). The user wants to see the run complete, not a promise that
+you'll check later.
+
 ## Deploying to the active skill
 
 The active skill lives at `~/.claude/skills/gstack/`. After making changes:

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -145,15 +145,15 @@ Spawns `claude -p` as a subprocess with `--output-format stream-json --verbose`,
 
 ```bash
 # Must run from a plain terminal — can't nest inside Claude Code or Conductor
-EVALS=1 bun test test/skill-e2e.test.ts
+EVALS=1 bun test test/skill-e2e-*.test.ts
 ```
 
 - Gated by `EVALS=1` env var (prevents accidental expensive runs)
 - Auto-skips if running inside Claude Code (`claude -p` can't nest)
 - API connectivity pre-check — fails fast on ConnectionRefused before burning budget
 - Real-time progress to stderr: `[Ns] turn T tool #C: Name(...)`
 - Saves full NDJSON transcripts and failure JSON for debugging
-- Tests live in `test/skill-e2e.test.ts`, runner logic in `test/helpers/session-runner.ts`
+- Tests live in `test/skill-e2e-*.test.ts` (split by category), runner logic in `test/helpers/session-runner.ts`
 
 ### E2E observability
 

diff --git a/TODOS.md b/TODOS.md
@@ -177,17 +177,6 @@
 **Priority:** P2
 **Depends on:** None
 
-### Post-deploy verification (ship + browse)
-
-**What:** After push, browse staging/preview URL, screenshot key pages, check console for JS errors, compare staging vs prod via snapshot diff. Include verification screenshots in PR body. STOP if critical errors found.
-
-**Why:** Catch deployment-time regressions (JS errors, broken layouts) before merge.
-
-**Context:** Requires S3 upload infrastructure for PR screenshots. Pairs with visual PR annotations.
-
-**Effort:** L
-**Priority:** P2
-**Depends on:** /setup-gstack-upload, visual PR annotations
 
 ### Visual verification with screenshots in PR body
 
@@ -348,14 +337,6 @@
 **Priority:** P3
 **Depends on:** Video recording
 
-### Deploy-verify skill
-
-**What:** Lightweight post-deploy smoke test: hit key URLs, verify 200s, screenshot critical pages, console error check, compare against baseline snapshots. Pass/fail with evidence.
-
-**Why:** Fast post-deploy confidence check, separate from full QA.
-
-**Effort:** M
-**Priority:** P2
 
 ### GitHub Actions eval upload
 
@@ -369,14 +350,11 @@
 **Priority:** P2
 **Depends on:** Eval persistence (shipped in v0.3.6)
 
-### E2E model pinning
-
-**What:** Pin E2E tests to claude-sonnet-4-6 for cost efficiency, add retry:2 for flaky LLM responses.
+### E2E model pinning — SHIPPED
 
-**Why:** Reduce E2E test cost and flakiness.
+~~**What:** Pin E2E tests to claude-sonnet-4-6 for cost efficiency, add retry:2 for flaky LLM responses.~~
 
-**Effort:** XS
-**Priority:** P2
+Shipped: Default model changed to Sonnet for structure tests (~30), Opus retained for quality tests (~10). `--retry 2` added. `EVALS_MODEL` env var for override. `test:e2e:fast` tier added. Rate-limit telemetry (first_response_ms, max_inter_turn_ms) and wall_clock_ms tracking added to eval-store.
 
 ### Eval web dashboard
 
@@ -486,17 +464,6 @@ Shipped in v0.8.3. Step 8.5 added to `/ship` — after creating the PR, `/ship`
 **Priority:** P3
 **Depends on:** gstack-diff-scope (shipped)
 
-### /merge skill — review-gated PR merge
-
-**What:** Create a `/merge` skill that merges an approved PR, but first checks the Review Readiness Dashboard and runs `/review` (Fix-First) if code review hasn't been done. Separates "ship" (create PR) from "merge" (land it).
-
-**Why:** Currently `/review` runs inside `/ship` Step 3.5 but isn't tracked as a gate. A `/merge` skill ensures code review always happens before landing, and enables workflows where someone else reviews the PR first.
-
-**Context:** `/ship` creates the PR. `/merge` would: check dashboard → run `/review` if needed → `gh pr merge`. This is where code review tracking belongs — at merge time, not at plan time.
-
-**Effort:** M
-**Priority:** P2
-**Depends on:** Ship Confidence Dashboard (shipped)
 
 ## Completeness
 
@@ -548,6 +515,12 @@ Shipped in v0.6.5. TemplateContext in gen-skill-docs.ts bakes skill name into pr
 
 ## Completed
 
+### Deploy pipeline (v0.7.0)
+- /merge skill — review-gated PR merge → superseded by /land-and-deploy
+- Deploy-verify skill → superseded by /land-and-deploy canary verification
+- Post-deploy verification (ship + browse) → superseded by /land-and-deploy
+**Completed:** v0.7.0
+
 ### Phase 1: Foundations (v0.2.0)
 - Rename to gstack
 - Restructure to monorepo layout