diff --git a/.agents/skills/gstack-office-hours/SKILL.md b/.agents/skills/gstack-office-hours/SKILL.md index c843d908..eb1b7084 100644 --- a/.agents/skills/gstack-office-hours/SKILL.md +++ b/.agents/skills/gstack-office-hours/SKILL.md @@ -218,6 +218,25 @@ success/error/abort, and `USED_BROWSE` with true/false based on whether `$B` was If you cannot determine the outcome, use "unknown". This runs in the background and never blocks the user. +## SETUP (run this check BEFORE any browse command) + +```bash +_ROOT=$(git rev-parse --show-toplevel 2>/dev/null) +B="" +[ -n "$_ROOT" ] && [ -x "$_ROOT/.agents/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.agents/skills/gstack/browse/dist/browse" +[ -z "$B" ] && B=~/.codex/skills/gstack/browse/dist/browse +if [ -x "$B" ]; then + echo "READY: $B" +else + echo "NEEDS_SETUP" +fi +``` + +If `NEEDS_SETUP`: +1. Tell the user: "gstack browse needs a one-time build (~10 seconds). OK to proceed?" Then STOP and wait. +2. Run: `cd && ./setup` +3. If `bun` is not installed: `curl -fsSL https://bun.sh/install | bash` + # YC Office Hours You are a **YC office hours partner**. Your job is to ensure the problem is understood before solutions are proposed. You adapt to what the user is building — startup founders get the hard questions, builders get an enthusiastic collaborator. This skill produces design docs, not code. @@ -482,6 +501,66 @@ Present via AskUserQuestion. Do NOT proceed without user approval of the approac --- +## Visual Sketch (UI ideas only) + +If the chosen approach involves user-facing UI (screens, pages, forms, dashboards, +or interactive elements), generate a rough wireframe to help the user visualize it. +If the idea is backend-only, infrastructure, or has no UI component — skip this +section silently. + +**Step 1: Gather design context** + +1. Check if `DESIGN.md` exists in the repo root. If it does, read it for design + system constraints (colors, typography, spacing, component patterns). Use these + constraints in the wireframe. +2. Apply core design principles: + - **Information hierarchy** — what does the user see first, second, third? + - **Interaction states** — loading, empty, error, success, partial + - **Edge case paranoia** — what if the name is 47 chars? Zero results? Network fails? + - **Subtraction default** — "as little design as possible" (Rams). Every element earns its pixels. + - **Design for trust** — every interface element builds or erodes user trust. + +**Step 2: Generate wireframe HTML** + +Generate a single-page HTML file with these constraints: +- **Intentionally rough aesthetic** — use system fonts, thin gray borders, no color, + hand-drawn-style elements. This is a sketch, not a polished mockup. +- Self-contained — no external dependencies, no CDN links, inline CSS only +- Show the core interaction flow (1-3 screens/states max) +- Include realistic placeholder content (not "Lorem ipsum" — use content that + matches the actual use case) +- Add HTML comments explaining design decisions + +Write to a temp file: +```bash +SKETCH_FILE="/tmp/gstack-sketch-$(date +%s).html" +``` + +**Step 3: Render and capture** + +```bash +$B goto "file://$SKETCH_FILE" +$B screenshot /tmp/gstack-sketch.png +``` + +If `$B` is not available (browse binary not set up), skip the render step. Tell the +user: "Visual sketch requires the browse binary. Run the setup script to enable it." + +**Step 4: Present and iterate** + +Show the screenshot to the user. Ask: "Does this feel right? Want to iterate on the layout?" + +If they want changes, regenerate the HTML with their feedback and re-render. +If they approve or say "good enough," proceed. + +**Step 5: Include in design doc** + +Reference the wireframe screenshot in the design doc's "Recommended Approach" section. +The screenshot file at `/tmp/gstack-sketch.png` can be referenced by downstream skills +(`/plan-design-review`, `/design-review`) to see what was originally envisioned. + +--- + ## Phase 4.5: Founder Signal Synthesis Before writing the design doc, synthesize the founder signals you observed during the session. These will appear in the design doc ("What I noticed") and in the closing conversation (Phase 6). @@ -618,7 +697,73 @@ Supersedes: {prior filename — omit this line if first design on this branch} {observational, mentor-like reflections referencing specific things the user said during the session. Quote their words back to them — don't characterize their behavior. 2-4 bullets.} ``` -Present the design doc to the user via AskUserQuestion: +--- + +## Spec Review Loop + +Before presenting the document to the user for approval, run an adversarial review. + +**Step 1: Dispatch reviewer subagent** + +Use the Agent tool to dispatch an independent reviewer. The reviewer has fresh context +and cannot see the brainstorming conversation — only the document. This ensures genuine +adversarial independence. + +Prompt the subagent with: +- The file path of the document just written +- "Read this document and review it on 5 dimensions. For each dimension, note PASS or + list specific issues with suggested fixes. At the end, output a quality score (1-10) + across all dimensions." + +**Dimensions:** +1. **Completeness** — Are all requirements addressed? Missing edge cases? +2. **Consistency** — Do parts of the document agree with each other? Contradictions? +3. **Clarity** — Could an engineer implement this without asking questions? Ambiguous language? +4. **Scope** — Does the document creep beyond the original problem? YAGNI violations? +5. **Feasibility** — Can this actually be built with the stated approach? Hidden complexity? + +The subagent should return: +- A quality score (1-10) +- PASS if no issues, or a numbered list of issues with dimension, description, and fix + +**Step 2: Fix and re-dispatch** + +If the reviewer returns issues: +1. Fix each issue in the document on disk (use Edit tool) +2. Re-dispatch the reviewer subagent with the updated document +3. Maximum 3 iterations total + +**Convergence guard:** If the reviewer returns the same issues on consecutive iterations +(the fix didn't resolve them or the reviewer disagrees with the fix), stop the loop +and persist those issues as "Reviewer Concerns" in the document rather than looping +further. + +If the subagent fails, times out, or is unavailable — skip the review loop entirely. +Tell the user: "Spec review unavailable — presenting unreviewed doc." The document is +already written to disk; the review is a quality bonus, not a gate. + +**Step 3: Report and persist metrics** + +After the loop completes (PASS, max iterations, or convergence guard): + +1. Tell the user the result — summary by default: + "Your doc survived N rounds of adversarial review. M issues caught and fixed. + Quality score: X/10." + If they ask "what did the reviewer find?", show the full reviewer output. + +2. If issues remain after max iterations or convergence, add a "## Reviewer Concerns" + section to the document listing each unresolved issue. Downstream skills will see this. + +3. Append metrics: +```bash +mkdir -p ~/.gstack/analytics +echo '{"skill":"office-hours","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","iterations":ITERATIONS,"issues_found":FOUND,"issues_fixed":FIXED,"remaining":REMAINING,"quality_score":SCORE}' >> ~/.gstack/analytics/spec-review.jsonl 2>/dev/null || true +``` +Replace ITERATIONS, FOUND, FIXED, REMAINING, SCORE with actual values from the review. + +--- + +Present the reviewed design doc to the user via AskUserQuestion: - A) Approve — mark Status: APPROVED and proceed to handoff - B) Revise — specify which sections need changes (loop back to revise those sections) - C) Start over — return to Phase 2 diff --git a/.agents/skills/gstack-plan-ceo-review/SKILL.md b/.agents/skills/gstack-plan-ceo-review/SKILL.md index dfb1c937..6d078f3a 100644 --- a/.agents/skills/gstack-plan-ceo-review/SKILL.md +++ b/.agents/skills/gstack-plan-ceo-review/SKILL.md @@ -324,6 +324,37 @@ DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head ``` If a design doc exists (from `/office-hours`), read it. Use it as the source of truth for the problem statement, constraints, and chosen approach. If it has a `Supersedes:` field, note that this is a revised design. +## Prerequisite Skill Offer + +When the design doc check above prints "No design doc found," offer the prerequisite +skill before proceeding. + +Say to the user via AskUserQuestion: + +> "No design doc found for this branch. `/office-hours` produces a structured problem +> statement, premise challenge, and explored alternatives — it gives this review much +> sharper input to work with. Takes about 10 minutes. The design doc is per-feature, +> not per-product — it captures the thinking behind this specific change." + +Options: +- A) Run /office-hours first (in another window, then come back) +- B) Skip — proceed with standard review + +If they skip: "No worries — standard review. If you ever want sharper input, try +/office-hours first next time." Then proceed normally. Do not re-offer later in the session. + +**Mid-session detection:** During Step 0A (Premise Challenge), if the user can't +articulate the problem, keeps changing the problem statement, answers with "I'm not +sure," or is clearly exploring rather than reviewing — offer `/office-hours`: + +> "It sounds like you're still figuring out what to build — that's totally fine, but +> that's what /office-hours is designed for. Want to pause this review and run +> /office-hours first? It'll help you nail down the problem and approach, then come +> back here for the strategic review." + +Options: A) Yes, run /office-hours first. B) No, keep going. +If they keep going, proceed normally — no guilt, no re-asking. + When reading TODOS.md, specifically: * Note any TODOs this plan touches, blocks, or unlocks * Check if deferred work from prior reviews relates to this plan @@ -467,6 +498,70 @@ Repo: {owner/repo} Derive the feature slug from the plan being reviewed (e.g., "user-dashboard", "auth-refactor"). Use the date in YYYY-MM-DD format. +After writing the CEO plan, run the spec review loop on it: + +## Spec Review Loop + +Before presenting the document to the user for approval, run an adversarial review. + +**Step 1: Dispatch reviewer subagent** + +Use the Agent tool to dispatch an independent reviewer. The reviewer has fresh context +and cannot see the brainstorming conversation — only the document. This ensures genuine +adversarial independence. + +Prompt the subagent with: +- The file path of the document just written +- "Read this document and review it on 5 dimensions. For each dimension, note PASS or + list specific issues with suggested fixes. At the end, output a quality score (1-10) + across all dimensions." + +**Dimensions:** +1. **Completeness** — Are all requirements addressed? Missing edge cases? +2. **Consistency** — Do parts of the document agree with each other? Contradictions? +3. **Clarity** — Could an engineer implement this without asking questions? Ambiguous language? +4. **Scope** — Does the document creep beyond the original problem? YAGNI violations? +5. **Feasibility** — Can this actually be built with the stated approach? Hidden complexity? + +The subagent should return: +- A quality score (1-10) +- PASS if no issues, or a numbered list of issues with dimension, description, and fix + +**Step 2: Fix and re-dispatch** + +If the reviewer returns issues: +1. Fix each issue in the document on disk (use Edit tool) +2. Re-dispatch the reviewer subagent with the updated document +3. Maximum 3 iterations total + +**Convergence guard:** If the reviewer returns the same issues on consecutive iterations +(the fix didn't resolve them or the reviewer disagrees with the fix), stop the loop +and persist those issues as "Reviewer Concerns" in the document rather than looping +further. + +If the subagent fails, times out, or is unavailable — skip the review loop entirely. +Tell the user: "Spec review unavailable — presenting unreviewed doc." The document is +already written to disk; the review is a quality bonus, not a gate. + +**Step 3: Report and persist metrics** + +After the loop completes (PASS, max iterations, or convergence guard): + +1. Tell the user the result — summary by default: + "Your doc survived N rounds of adversarial review. M issues caught and fixed. + Quality score: X/10." + If they ask "what did the reviewer find?", show the full reviewer output. + +2. If issues remain after max iterations or convergence, add a "## Reviewer Concerns" + section to the document listing each unresolved issue. Downstream skills will see this. + +3. Append metrics: +```bash +mkdir -p ~/.gstack/analytics +echo '{"skill":"plan-ceo-review","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","iterations":ITERATIONS,"issues_found":FOUND,"issues_fixed":FIXED,"remaining":REMAINING,"quality_score":SCORE}' >> ~/.gstack/analytics/spec-review.jsonl 2>/dev/null || true +``` +Replace ITERATIONS, FOUND, FIXED, REMAINING, SCORE with actual values from the review. + ### 0E. Temporal Interrogation (EXPANSION, SELECTIVE EXPANSION, and HOLD modes) Think ahead to implementation: What decisions will need to be made during implementation that should be resolved NOW in the plan? ``` diff --git a/.agents/skills/gstack-plan-eng-review/SKILL.md b/.agents/skills/gstack-plan-eng-review/SKILL.md index 492bf9f8..d4cff7cd 100644 --- a/.agents/skills/gstack-plan-eng-review/SKILL.md +++ b/.agents/skills/gstack-plan-eng-review/SKILL.md @@ -269,6 +269,25 @@ DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head ``` If a design doc exists, read it. Use it as the source of truth for the problem statement, constraints, and chosen approach. If it has a `Supersedes:` field, note that this is a revised design — check the prior version for context on what changed and why. +## Prerequisite Skill Offer + +When the design doc check above prints "No design doc found," offer the prerequisite +skill before proceeding. + +Say to the user via AskUserQuestion: + +> "No design doc found for this branch. `/office-hours` produces a structured problem +> statement, premise challenge, and explored alternatives — it gives this review much +> sharper input to work with. Takes about 10 minutes. The design doc is per-feature, +> not per-product — it captures the thinking behind this specific change." + +Options: +- A) Run /office-hours first (in another window, then come back) +- B) Skip — proceed with standard review + +If they skip: "No worries — standard review. If you ever want sharper input, try +/office-hours first next time." Then proceed normally. Do not re-offer later in the session. + ### Step 0: Scope Challenge Before reviewing anything, answer these questions: 1. **What existing code already partially or fully solves each sub-problem?** Can we capture outputs from existing flows rather than building parallel ones? diff --git a/CHANGELOG.md b/CHANGELOG.md index 74572e22..e0259c60 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,14 @@ # Changelog +## [0.9.1.0] - 2026-03-20 — Adversarial Spec Review + Skill Chaining + +### Added + +- **Your design docs now get stress-tested before you see them.** When you run `/office-hours`, an independent AI reviewer checks your design doc for completeness, consistency, clarity, scope creep, and feasibility — up to 3 rounds. You get a quality score (1-10) and a summary of what was caught and fixed. The doc you approve has already survived adversarial review. +- **Visual wireframes during brainstorming.** For UI ideas, `/office-hours` now generates a rough HTML wireframe using your project's design system (from DESIGN.md) and screenshots it. You see what you're designing while you're still thinking, not after you've coded it. +- **Skills help each other now.** `/plan-ceo-review` and `/plan-eng-review` detect when you'd benefit from running `/office-hours` first and offer it — one-tap to switch, one-tap to decline. If you seem lost during a CEO review, it'll gently suggest brainstorming first. +- **Spec review metrics.** Every adversarial review logs iterations, issues found/fixed, and quality score to `~/.gstack/analytics/spec-review.jsonl`. Over time, you can see if your design docs are getting better. + ## [0.9.0.1] - 2026-03-19 ### Changed diff --git a/VERSION b/VERSION index 15e36e66..cf94a424 100644 --- a/VERSION +++ b/VERSION @@ -1 +1 @@ -0.9.0.1 +0.9.1.0 diff --git a/office-hours/SKILL.md b/office-hours/SKILL.md index ff0aeafa..2a2e7583 100644 --- a/office-hours/SKILL.md +++ b/office-hours/SKILL.md @@ -227,6 +227,25 @@ success/error/abort, and `USED_BROWSE` with true/false based on whether `$B` was If you cannot determine the outcome, use "unknown". This runs in the background and never blocks the user. +## SETUP (run this check BEFORE any browse command) + +```bash +_ROOT=$(git rev-parse --show-toplevel 2>/dev/null) +B="" +[ -n "$_ROOT" ] && [ -x "$_ROOT/.claude/skills/gstack/browse/dist/browse" ] && B="$_ROOT/.claude/skills/gstack/browse/dist/browse" +[ -z "$B" ] && B=~/.claude/skills/gstack/browse/dist/browse +if [ -x "$B" ]; then + echo "READY: $B" +else + echo "NEEDS_SETUP" +fi +``` + +If `NEEDS_SETUP`: +1. Tell the user: "gstack browse needs a one-time build (~10 seconds). OK to proceed?" Then STOP and wait. +2. Run: `cd && ./setup` +3. If `bun` is not installed: `curl -fsSL https://bun.sh/install | bash` + # YC Office Hours You are a **YC office hours partner**. Your job is to ensure the problem is understood before solutions are proposed. You adapt to what the user is building — startup founders get the hard questions, builders get an enthusiastic collaborator. This skill produces design docs, not code. @@ -491,6 +510,66 @@ Present via AskUserQuestion. Do NOT proceed without user approval of the approac --- +## Visual Sketch (UI ideas only) + +If the chosen approach involves user-facing UI (screens, pages, forms, dashboards, +or interactive elements), generate a rough wireframe to help the user visualize it. +If the idea is backend-only, infrastructure, or has no UI component — skip this +section silently. + +**Step 1: Gather design context** + +1. Check if `DESIGN.md` exists in the repo root. If it does, read it for design + system constraints (colors, typography, spacing, component patterns). Use these + constraints in the wireframe. +2. Apply core design principles: + - **Information hierarchy** — what does the user see first, second, third? + - **Interaction states** — loading, empty, error, success, partial + - **Edge case paranoia** — what if the name is 47 chars? Zero results? Network fails? + - **Subtraction default** — "as little design as possible" (Rams). Every element earns its pixels. + - **Design for trust** — every interface element builds or erodes user trust. + +**Step 2: Generate wireframe HTML** + +Generate a single-page HTML file with these constraints: +- **Intentionally rough aesthetic** — use system fonts, thin gray borders, no color, + hand-drawn-style elements. This is a sketch, not a polished mockup. +- Self-contained — no external dependencies, no CDN links, inline CSS only +- Show the core interaction flow (1-3 screens/states max) +- Include realistic placeholder content (not "Lorem ipsum" — use content that + matches the actual use case) +- Add HTML comments explaining design decisions + +Write to a temp file: +```bash +SKETCH_FILE="/tmp/gstack-sketch-$(date +%s).html" +``` + +**Step 3: Render and capture** + +```bash +$B goto "file://$SKETCH_FILE" +$B screenshot /tmp/gstack-sketch.png +``` + +If `$B` is not available (browse binary not set up), skip the render step. Tell the +user: "Visual sketch requires the browse binary. Run the setup script to enable it." + +**Step 4: Present and iterate** + +Show the screenshot to the user. Ask: "Does this feel right? Want to iterate on the layout?" + +If they want changes, regenerate the HTML with their feedback and re-render. +If they approve or say "good enough," proceed. + +**Step 5: Include in design doc** + +Reference the wireframe screenshot in the design doc's "Recommended Approach" section. +The screenshot file at `/tmp/gstack-sketch.png` can be referenced by downstream skills +(`/plan-design-review`, `/design-review`) to see what was originally envisioned. + +--- + ## Phase 4.5: Founder Signal Synthesis Before writing the design doc, synthesize the founder signals you observed during the session. These will appear in the design doc ("What I noticed") and in the closing conversation (Phase 6). @@ -627,7 +706,73 @@ Supersedes: {prior filename — omit this line if first design on this branch} {observational, mentor-like reflections referencing specific things the user said during the session. Quote their words back to them — don't characterize their behavior. 2-4 bullets.} ``` -Present the design doc to the user via AskUserQuestion: +--- + +## Spec Review Loop + +Before presenting the document to the user for approval, run an adversarial review. + +**Step 1: Dispatch reviewer subagent** + +Use the Agent tool to dispatch an independent reviewer. The reviewer has fresh context +and cannot see the brainstorming conversation — only the document. This ensures genuine +adversarial independence. + +Prompt the subagent with: +- The file path of the document just written +- "Read this document and review it on 5 dimensions. For each dimension, note PASS or + list specific issues with suggested fixes. At the end, output a quality score (1-10) + across all dimensions." + +**Dimensions:** +1. **Completeness** — Are all requirements addressed? Missing edge cases? +2. **Consistency** — Do parts of the document agree with each other? Contradictions? +3. **Clarity** — Could an engineer implement this without asking questions? Ambiguous language? +4. **Scope** — Does the document creep beyond the original problem? YAGNI violations? +5. **Feasibility** — Can this actually be built with the stated approach? Hidden complexity? + +The subagent should return: +- A quality score (1-10) +- PASS if no issues, or a numbered list of issues with dimension, description, and fix + +**Step 2: Fix and re-dispatch** + +If the reviewer returns issues: +1. Fix each issue in the document on disk (use Edit tool) +2. Re-dispatch the reviewer subagent with the updated document +3. Maximum 3 iterations total + +**Convergence guard:** If the reviewer returns the same issues on consecutive iterations +(the fix didn't resolve them or the reviewer disagrees with the fix), stop the loop +and persist those issues as "Reviewer Concerns" in the document rather than looping +further. + +If the subagent fails, times out, or is unavailable — skip the review loop entirely. +Tell the user: "Spec review unavailable — presenting unreviewed doc." The document is +already written to disk; the review is a quality bonus, not a gate. + +**Step 3: Report and persist metrics** + +After the loop completes (PASS, max iterations, or convergence guard): + +1. Tell the user the result — summary by default: + "Your doc survived N rounds of adversarial review. M issues caught and fixed. + Quality score: X/10." + If they ask "what did the reviewer find?", show the full reviewer output. + +2. If issues remain after max iterations or convergence, add a "## Reviewer Concerns" + section to the document listing each unresolved issue. Downstream skills will see this. + +3. Append metrics: +```bash +mkdir -p ~/.gstack/analytics +echo '{"skill":"office-hours","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","iterations":ITERATIONS,"issues_found":FOUND,"issues_fixed":FIXED,"remaining":REMAINING,"quality_score":SCORE}' >> ~/.gstack/analytics/spec-review.jsonl 2>/dev/null || true +``` +Replace ITERATIONS, FOUND, FIXED, REMAINING, SCORE with actual values from the review. + +--- + +Present the reviewed design doc to the user via AskUserQuestion: - A) Approve — mark Status: APPROVED and proceed to handoff - B) Revise — specify which sections need changes (loop back to revise those sections) - C) Start over — return to Phase 2 diff --git a/office-hours/SKILL.md.tmpl b/office-hours/SKILL.md.tmpl index caf91acb..e0ff98a7 100644 --- a/office-hours/SKILL.md.tmpl +++ b/office-hours/SKILL.md.tmpl @@ -23,6 +23,8 @@ allowed-tools: {{PREAMBLE}} +{{BROWSE_SETUP}} + # YC Office Hours You are a **YC office hours partner**. Your job is to ensure the problem is understood before solutions are proposed. You adapt to what the user is building — startup founders get the hard questions, builders get an enthusiastic collaborator. This skill produces design docs, not code. @@ -287,6 +289,10 @@ Present via AskUserQuestion. Do NOT proceed without user approval of the approac --- +{{DESIGN_SKETCH}} + +--- + ## Phase 4.5: Founder Signal Synthesis Before writing the design doc, synthesize the founder signals you observed during the session. These will appear in the design doc ("What I noticed") and in the closing conversation (Phase 6). @@ -423,7 +429,13 @@ Supersedes: {prior filename — omit this line if first design on this branch} {observational, mentor-like reflections referencing specific things the user said during the session. Quote their words back to them — don't characterize their behavior. 2-4 bullets.} ``` -Present the design doc to the user via AskUserQuestion: +--- + +{{SPEC_REVIEW_LOOP}} + +--- + +Present the reviewed design doc to the user via AskUserQuestion: - A) Approve — mark Status: APPROVED and proceed to handoff - B) Revise — specify which sections need changes (loop back to revise those sections) - C) Start over — return to Phase 2 diff --git a/plan-ceo-review/SKILL.md b/plan-ceo-review/SKILL.md index 68b234d6..44fc4013 100644 --- a/plan-ceo-review/SKILL.md +++ b/plan-ceo-review/SKILL.md @@ -10,6 +10,7 @@ description: | or "is this ambitious enough". Proactively suggest when the user is questioning scope or ambition of a plan, or when the plan feels like it could be thinking bigger. +benefits-from: [office-hours] allowed-tools: - Read - Grep @@ -331,6 +332,37 @@ DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head ``` If a design doc exists (from `/office-hours`), read it. Use it as the source of truth for the problem statement, constraints, and chosen approach. If it has a `Supersedes:` field, note that this is a revised design. +## Prerequisite Skill Offer + +When the design doc check above prints "No design doc found," offer the prerequisite +skill before proceeding. + +Say to the user via AskUserQuestion: + +> "No design doc found for this branch. `/office-hours` produces a structured problem +> statement, premise challenge, and explored alternatives — it gives this review much +> sharper input to work with. Takes about 10 minutes. The design doc is per-feature, +> not per-product — it captures the thinking behind this specific change." + +Options: +- A) Run /office-hours first (in another window, then come back) +- B) Skip — proceed with standard review + +If they skip: "No worries — standard review. If you ever want sharper input, try +/office-hours first next time." Then proceed normally. Do not re-offer later in the session. + +**Mid-session detection:** During Step 0A (Premise Challenge), if the user can't +articulate the problem, keeps changing the problem statement, answers with "I'm not +sure," or is clearly exploring rather than reviewing — offer `/office-hours`: + +> "It sounds like you're still figuring out what to build — that's totally fine, but +> that's what /office-hours is designed for. Want to pause this review and run +> /office-hours first? It'll help you nail down the problem and approach, then come +> back here for the strategic review." + +Options: A) Yes, run /office-hours first. B) No, keep going. +If they keep going, proceed normally — no guilt, no re-asking. + When reading TODOS.md, specifically: * Note any TODOs this plan touches, blocks, or unlocks * Check if deferred work from prior reviews relates to this plan @@ -474,6 +506,70 @@ Repo: {owner/repo} Derive the feature slug from the plan being reviewed (e.g., "user-dashboard", "auth-refactor"). Use the date in YYYY-MM-DD format. +After writing the CEO plan, run the spec review loop on it: + +## Spec Review Loop + +Before presenting the document to the user for approval, run an adversarial review. + +**Step 1: Dispatch reviewer subagent** + +Use the Agent tool to dispatch an independent reviewer. The reviewer has fresh context +and cannot see the brainstorming conversation — only the document. This ensures genuine +adversarial independence. + +Prompt the subagent with: +- The file path of the document just written +- "Read this document and review it on 5 dimensions. For each dimension, note PASS or + list specific issues with suggested fixes. At the end, output a quality score (1-10) + across all dimensions." + +**Dimensions:** +1. **Completeness** — Are all requirements addressed? Missing edge cases? +2. **Consistency** — Do parts of the document agree with each other? Contradictions? +3. **Clarity** — Could an engineer implement this without asking questions? Ambiguous language? +4. **Scope** — Does the document creep beyond the original problem? YAGNI violations? +5. **Feasibility** — Can this actually be built with the stated approach? Hidden complexity? + +The subagent should return: +- A quality score (1-10) +- PASS if no issues, or a numbered list of issues with dimension, description, and fix + +**Step 2: Fix and re-dispatch** + +If the reviewer returns issues: +1. Fix each issue in the document on disk (use Edit tool) +2. Re-dispatch the reviewer subagent with the updated document +3. Maximum 3 iterations total + +**Convergence guard:** If the reviewer returns the same issues on consecutive iterations +(the fix didn't resolve them or the reviewer disagrees with the fix), stop the loop +and persist those issues as "Reviewer Concerns" in the document rather than looping +further. + +If the subagent fails, times out, or is unavailable — skip the review loop entirely. +Tell the user: "Spec review unavailable — presenting unreviewed doc." The document is +already written to disk; the review is a quality bonus, not a gate. + +**Step 3: Report and persist metrics** + +After the loop completes (PASS, max iterations, or convergence guard): + +1. Tell the user the result — summary by default: + "Your doc survived N rounds of adversarial review. M issues caught and fixed. + Quality score: X/10." + If they ask "what did the reviewer find?", show the full reviewer output. + +2. If issues remain after max iterations or convergence, add a "## Reviewer Concerns" + section to the document listing each unresolved issue. Downstream skills will see this. + +3. Append metrics: +```bash +mkdir -p ~/.gstack/analytics +echo '{"skill":"plan-ceo-review","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","iterations":ITERATIONS,"issues_found":FOUND,"issues_fixed":FIXED,"remaining":REMAINING,"quality_score":SCORE}' >> ~/.gstack/analytics/spec-review.jsonl 2>/dev/null || true +``` +Replace ITERATIONS, FOUND, FIXED, REMAINING, SCORE with actual values from the review. + ### 0E. Temporal Interrogation (EXPANSION, SELECTIVE EXPANSION, and HOLD modes) Think ahead to implementation: What decisions will need to be made during implementation that should be resolved NOW in the plan? ``` diff --git a/plan-ceo-review/SKILL.md.tmpl b/plan-ceo-review/SKILL.md.tmpl index 87dec8e7..8dce40eb 100644 --- a/plan-ceo-review/SKILL.md.tmpl +++ b/plan-ceo-review/SKILL.md.tmpl @@ -10,6 +10,7 @@ description: | or "is this ambitious enough". Proactively suggest when the user is questioning scope or ambition of a plan, or when the plan feels like it could be thinking bigger. +benefits-from: [office-hours] allowed-tools: - Read - Grep @@ -110,6 +111,20 @@ DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head ``` If a design doc exists (from `/office-hours`), read it. Use it as the source of truth for the problem statement, constraints, and chosen approach. If it has a `Supersedes:` field, note that this is a revised design. +{{BENEFITS_FROM}} + +**Mid-session detection:** During Step 0A (Premise Challenge), if the user can't +articulate the problem, keeps changing the problem statement, answers with "I'm not +sure," or is clearly exploring rather than reviewing — offer `/office-hours`: + +> "It sounds like you're still figuring out what to build — that's totally fine, but +> that's what /office-hours is designed for. Want to pause this review and run +> /office-hours first? It'll help you nail down the problem and approach, then come +> back here for the strategic review." + +Options: A) Yes, run /office-hours first. B) No, keep going. +If they keep going, proceed normally — no guilt, no re-asking. + When reading TODOS.md, specifically: * Note any TODOs this plan touches, blocks, or unlocks * Check if deferred work from prior reviews relates to this plan @@ -253,6 +268,10 @@ Repo: {owner/repo} Derive the feature slug from the plan being reviewed (e.g., "user-dashboard", "auth-refactor"). Use the date in YYYY-MM-DD format. +After writing the CEO plan, run the spec review loop on it: + +{{SPEC_REVIEW_LOOP}} + ### 0E. Temporal Interrogation (EXPANSION, SELECTIVE EXPANSION, and HOLD modes) Think ahead to implementation: What decisions will need to be made during implementation that should be resolved NOW in the plan? ``` diff --git a/plan-eng-review/SKILL.md b/plan-eng-review/SKILL.md index 45ac15d0..078a2875 100644 --- a/plan-eng-review/SKILL.md +++ b/plan-eng-review/SKILL.md @@ -8,6 +8,7 @@ description: | "review the architecture", "engineering review", or "lock in the plan". Proactively suggest when the user has a plan or design doc and is about to start coding — to catch architecture issues before implementation. +benefits-from: [office-hours] allowed-tools: - Read - Write @@ -277,6 +278,25 @@ DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head ``` If a design doc exists, read it. Use it as the source of truth for the problem statement, constraints, and chosen approach. If it has a `Supersedes:` field, note that this is a revised design — check the prior version for context on what changed and why. +## Prerequisite Skill Offer + +When the design doc check above prints "No design doc found," offer the prerequisite +skill before proceeding. + +Say to the user via AskUserQuestion: + +> "No design doc found for this branch. `/office-hours` produces a structured problem +> statement, premise challenge, and explored alternatives — it gives this review much +> sharper input to work with. Takes about 10 minutes. The design doc is per-feature, +> not per-product — it captures the thinking behind this specific change." + +Options: +- A) Run /office-hours first (in another window, then come back) +- B) Skip — proceed with standard review + +If they skip: "No worries — standard review. If you ever want sharper input, try +/office-hours first next time." Then proceed normally. Do not re-offer later in the session. + ### Step 0: Scope Challenge Before reviewing anything, answer these questions: 1. **What existing code already partially or fully solves each sub-problem?** Can we capture outputs from existing flows rather than building parallel ones? diff --git a/plan-eng-review/SKILL.md.tmpl b/plan-eng-review/SKILL.md.tmpl index ef21a200..09782a9d 100644 --- a/plan-eng-review/SKILL.md.tmpl +++ b/plan-eng-review/SKILL.md.tmpl @@ -8,6 +8,7 @@ description: | "review the architecture", "engineering review", or "lock in the plan". Proactively suggest when the user has a plan or design doc and is about to start coding — to catch architecture issues before implementation. +benefits-from: [office-hours] allowed-tools: - Read - Write @@ -73,6 +74,8 @@ DESIGN=$(ls -t ~/.gstack/projects/$SLUG/*-$BRANCH-design-*.md 2>/dev/null | head ``` If a design doc exists, read it. Use it as the source of truth for the problem statement, constraints, and chosen approach. If it has a `Supersedes:` field, note that this is a revised design — check the prior version for context on what changed and why. +{{BENEFITS_FROM}} + ### Step 0: Scope Challenge Before reviewing anything, answer these questions: 1. **What existing code already partially or fully solves each sub-problem?** Can we capture outputs from existing flows rather than building parallel ones? diff --git a/scripts/gen-skill-docs.ts b/scripts/gen-skill-docs.ts index 8ac36a46..53e8834f 100644 --- a/scripts/gen-skill-docs.ts +++ b/scripts/gen-skill-docs.ts @@ -55,6 +55,7 @@ const HOST_PATHS: Record = { interface TemplateContext { skillName: string; tmplPath: string; + benefitsFrom?: string[]; host: Host; paths: HostPaths; } @@ -1261,6 +1262,156 @@ Only commit if there are changes. Stage all bootstrap files (config, test direct ---`; } +function generateSpecReviewLoop(_ctx: TemplateContext): string { + return `## Spec Review Loop + +Before presenting the document to the user for approval, run an adversarial review. + +**Step 1: Dispatch reviewer subagent** + +Use the Agent tool to dispatch an independent reviewer. The reviewer has fresh context +and cannot see the brainstorming conversation — only the document. This ensures genuine +adversarial independence. + +Prompt the subagent with: +- The file path of the document just written +- "Read this document and review it on 5 dimensions. For each dimension, note PASS or + list specific issues with suggested fixes. At the end, output a quality score (1-10) + across all dimensions." + +**Dimensions:** +1. **Completeness** — Are all requirements addressed? Missing edge cases? +2. **Consistency** — Do parts of the document agree with each other? Contradictions? +3. **Clarity** — Could an engineer implement this without asking questions? Ambiguous language? +4. **Scope** — Does the document creep beyond the original problem? YAGNI violations? +5. **Feasibility** — Can this actually be built with the stated approach? Hidden complexity? + +The subagent should return: +- A quality score (1-10) +- PASS if no issues, or a numbered list of issues with dimension, description, and fix + +**Step 2: Fix and re-dispatch** + +If the reviewer returns issues: +1. Fix each issue in the document on disk (use Edit tool) +2. Re-dispatch the reviewer subagent with the updated document +3. Maximum 3 iterations total + +**Convergence guard:** If the reviewer returns the same issues on consecutive iterations +(the fix didn't resolve them or the reviewer disagrees with the fix), stop the loop +and persist those issues as "Reviewer Concerns" in the document rather than looping +further. + +If the subagent fails, times out, or is unavailable — skip the review loop entirely. +Tell the user: "Spec review unavailable — presenting unreviewed doc." The document is +already written to disk; the review is a quality bonus, not a gate. + +**Step 3: Report and persist metrics** + +After the loop completes (PASS, max iterations, or convergence guard): + +1. Tell the user the result — summary by default: + "Your doc survived N rounds of adversarial review. M issues caught and fixed. + Quality score: X/10." + If they ask "what did the reviewer find?", show the full reviewer output. + +2. If issues remain after max iterations or convergence, add a "## Reviewer Concerns" + section to the document listing each unresolved issue. Downstream skills will see this. + +3. Append metrics: +\`\`\`bash +mkdir -p ~/.gstack/analytics +echo '{"skill":"${_ctx.skillName}","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","iterations":ITERATIONS,"issues_found":FOUND,"issues_fixed":FIXED,"remaining":REMAINING,"quality_score":SCORE}' >> ~/.gstack/analytics/spec-review.jsonl 2>/dev/null || true +\`\`\` +Replace ITERATIONS, FOUND, FIXED, REMAINING, SCORE with actual values from the review.`; +} + +function generateBenefitsFrom(ctx: TemplateContext): string { + if (!ctx.benefitsFrom || ctx.benefitsFrom.length === 0) return ''; + + const skillList = ctx.benefitsFrom.map(s => `\`/${s}\``).join(' or '); + const first = ctx.benefitsFrom[0]; + + return `## Prerequisite Skill Offer + +When the design doc check above prints "No design doc found," offer the prerequisite +skill before proceeding. + +Say to the user via AskUserQuestion: + +> "No design doc found for this branch. ${skillList} produces a structured problem +> statement, premise challenge, and explored alternatives — it gives this review much +> sharper input to work with. Takes about 10 minutes. The design doc is per-feature, +> not per-product — it captures the thinking behind this specific change." + +Options: +- A) Run /${first} first (in another window, then come back) +- B) Skip — proceed with standard review + +If they skip: "No worries — standard review. If you ever want sharper input, try +/${first} first next time." Then proceed normally. Do not re-offer later in the session.`; +} + +function generateDesignSketch(_ctx: TemplateContext): string { + return `## Visual Sketch (UI ideas only) + +If the chosen approach involves user-facing UI (screens, pages, forms, dashboards, +or interactive elements), generate a rough wireframe to help the user visualize it. +If the idea is backend-only, infrastructure, or has no UI component — skip this +section silently. + +**Step 1: Gather design context** + +1. Check if \`DESIGN.md\` exists in the repo root. If it does, read it for design + system constraints (colors, typography, spacing, component patterns). Use these + constraints in the wireframe. +2. Apply core design principles: + - **Information hierarchy** — what does the user see first, second, third? + - **Interaction states** — loading, empty, error, success, partial + - **Edge case paranoia** — what if the name is 47 chars? Zero results? Network fails? + - **Subtraction default** — "as little design as possible" (Rams). Every element earns its pixels. + - **Design for trust** — every interface element builds or erodes user trust. + +**Step 2: Generate wireframe HTML** + +Generate a single-page HTML file with these constraints: +- **Intentionally rough aesthetic** — use system fonts, thin gray borders, no color, + hand-drawn-style elements. This is a sketch, not a polished mockup. +- Self-contained — no external dependencies, no CDN links, inline CSS only +- Show the core interaction flow (1-3 screens/states max) +- Include realistic placeholder content (not "Lorem ipsum" — use content that + matches the actual use case) +- Add HTML comments explaining design decisions + +Write to a temp file: +\`\`\`bash +SKETCH_FILE="/tmp/gstack-sketch-$(date +%s).html" +\`\`\` + +**Step 3: Render and capture** + +\`\`\`bash +$B goto "file://$SKETCH_FILE" +$B screenshot /tmp/gstack-sketch.png +\`\`\` + +If \`$B\` is not available (browse binary not set up), skip the render step. Tell the +user: "Visual sketch requires the browse binary. Run the setup script to enable it." + +**Step 4: Present and iterate** + +Show the screenshot to the user. Ask: "Does this feel right? Want to iterate on the layout?" + +If they want changes, regenerate the HTML with their feedback and re-render. +If they approve or say "good enough," proceed. + +**Step 5: Include in design doc** + +Reference the wireframe screenshot in the design doc's "Recommended Approach" section. +The screenshot file at \`/tmp/gstack-sketch.png\` can be referenced by downstream skills +(\`/plan-design-review\`, \`/design-review\`) to see what was originally envisioned.`; +} + const RESOLVERS: Record string> = { COMMAND_REFERENCE: generateCommandReference, SNAPSHOT_FLAGS: generateSnapshotFlags, @@ -1272,6 +1423,9 @@ const RESOLVERS: Record string> = { DESIGN_REVIEW_LITE: generateDesignReviewLite, REVIEW_DASHBOARD: generateReviewDashboard, TEST_BOOTSTRAP: generateTestBootstrap, + SPEC_REVIEW_LOOP: generateSpecReviewLoop, + DESIGN_SKETCH: generateDesignSketch, + BENEFITS_FROM: generateBenefitsFrom, }; // ─── Codex Helpers ─────────────────────────────────────────── @@ -1394,7 +1548,14 @@ function processTemplate(tmplPath: string, host: Host = 'claude'): { outputPath: // Extract skill name from frontmatter for TemplateContext const nameMatch = tmplContent.match(/^name:\s*(.+)$/m); const skillName = nameMatch ? nameMatch[1].trim() : path.basename(path.dirname(tmplPath)); - const ctx: TemplateContext = { skillName, tmplPath, host, paths: HOST_PATHS[host] }; + + // Extract benefits-from list from frontmatter (inline YAML: benefits-from: [a, b]) + const benefitsMatch = tmplContent.match(/^benefits-from:\s*\[([^\]]*)\]/m); + const benefitsFrom = benefitsMatch + ? benefitsMatch[1].split(',').map(s => s.trim()).filter(Boolean) + : undefined; + + const ctx: TemplateContext = { skillName, tmplPath, benefitsFrom, host, paths: HOST_PATHS[host] }; // Replace placeholders let content = tmplContent.replace(/\{\{(\w+)\}\}/g, (match, name) => { diff --git a/test/gen-skill-docs.test.ts b/test/gen-skill-docs.test.ts index 49714f2a..68d84465 100644 --- a/test/gen-skill-docs.test.ts +++ b/test/gen-skill-docs.test.ts @@ -416,6 +416,98 @@ describe('REVIEW_DASHBOARD resolver', () => { }); }); +// --- {{SPEC_REVIEW_LOOP}} resolver tests --- + +describe('SPEC_REVIEW_LOOP resolver', () => { + const content = fs.readFileSync(path.join(ROOT, 'office-hours', 'SKILL.md'), 'utf-8'); + + test('contains all 5 review dimensions', () => { + for (const dim of ['Completeness', 'Consistency', 'Clarity', 'Scope', 'Feasibility']) { + expect(content).toContain(dim); + } + }); + + test('references Agent tool for subagent dispatch', () => { + expect(content).toMatch(/Agent.*tool/i); + }); + + test('specifies max 3 iterations', () => { + expect(content).toMatch(/3.*iteration|maximum.*3/i); + }); + + test('includes quality score', () => { + expect(content).toContain('quality score'); + }); + + test('includes metrics path', () => { + expect(content).toContain('spec-review.jsonl'); + }); + + test('includes convergence guard', () => { + expect(content).toMatch(/[Cc]onvergence/); + }); + + test('includes graceful failure handling', () => { + expect(content).toMatch(/skip.*review|unavailable/i); + }); +}); + +// --- {{DESIGN_SKETCH}} resolver tests --- + +describe('DESIGN_SKETCH resolver', () => { + const content = fs.readFileSync(path.join(ROOT, 'office-hours', 'SKILL.md'), 'utf-8'); + + test('references DESIGN.md for design system constraints', () => { + expect(content).toContain('DESIGN.md'); + }); + + test('contains wireframe or sketch terminology', () => { + expect(content).toMatch(/wireframe|sketch/i); + }); + + test('references browse binary for rendering', () => { + expect(content).toContain('$B goto'); + }); + + test('references screenshot capture', () => { + expect(content).toContain('$B screenshot'); + }); + + test('specifies rough aesthetic', () => { + expect(content).toMatch(/[Rr]ough|hand-drawn/); + }); + + test('includes skip conditions', () => { + expect(content).toMatch(/no UI component|skip/i); + }); +}); + +// --- {{BENEFITS_FROM}} resolver tests --- + +describe('BENEFITS_FROM resolver', () => { + const ceoContent = fs.readFileSync(path.join(ROOT, 'plan-ceo-review', 'SKILL.md'), 'utf-8'); + const engContent = fs.readFileSync(path.join(ROOT, 'plan-eng-review', 'SKILL.md'), 'utf-8'); + + test('plan-ceo-review contains prerequisite skill offer', () => { + expect(ceoContent).toContain('Prerequisite Skill Offer'); + expect(ceoContent).toContain('/office-hours'); + }); + + test('plan-eng-review contains prerequisite skill offer', () => { + expect(engContent).toContain('Prerequisite Skill Offer'); + expect(engContent).toContain('/office-hours'); + }); + + test('offer includes graceful decline', () => { + expect(ceoContent).toContain('No worries'); + }); + + test('skills without benefits-from do NOT have prerequisite offer', () => { + const qaContent = fs.readFileSync(path.join(ROOT, 'qa', 'SKILL.md'), 'utf-8'); + expect(qaContent).not.toContain('Prerequisite Skill Offer'); + }); +}); + // ─── Codex Generation Tests ───────────────────────────────── describe('Codex generation (--host codex)', () => { diff --git a/test/helpers/touchfiles.ts b/test/helpers/touchfiles.ts index 53cc709c..c516a3b5 100644 --- a/test/helpers/touchfiles.ts +++ b/test/helpers/touchfiles.ts @@ -57,9 +57,13 @@ export const E2E_TOUCHFILES: Record = { 'review-base-branch': ['review/**'], 'review-design-lite': ['review/**', 'test/fixtures/review-eval-design-slop.*'], + // Office Hours + 'office-hours-spec-review': ['office-hours/**', 'scripts/gen-skill-docs.ts'], + // Plan reviews 'plan-ceo-review': ['plan-ceo-review/**'], 'plan-ceo-review-selective': ['plan-ceo-review/**'], + 'plan-ceo-review-benefits': ['plan-ceo-review/**', 'scripts/gen-skill-docs.ts'], 'plan-eng-review': ['plan-eng-review/**'], 'plan-eng-review-artifact': ['plan-eng-review/**'], @@ -140,6 +144,10 @@ export const LLM_JUDGE_TOUCHFILES: Record = { 'design-review/SKILL.md fix loop': ['design-review/SKILL.md', 'design-review/SKILL.md.tmpl'], 'design-consultation/SKILL.md research': ['design-consultation/SKILL.md', 'design-consultation/SKILL.md.tmpl'], + // Office Hours + 'office-hours/SKILL.md spec review': ['office-hours/SKILL.md', 'office-hours/SKILL.md.tmpl', 'scripts/gen-skill-docs.ts'], + 'office-hours/SKILL.md design sketch': ['office-hours/SKILL.md', 'office-hours/SKILL.md.tmpl', 'scripts/gen-skill-docs.ts'], + // Other skills 'retro/SKILL.md instructions': ['retro/SKILL.md', 'retro/SKILL.md.tmpl'], 'qa-only/SKILL.md workflow': ['qa-only/SKILL.md', 'qa-only/SKILL.md.tmpl'], diff --git a/test/skill-e2e.test.ts b/test/skill-e2e.test.ts index 96019f70..0b6331f3 100644 --- a/test/skill-e2e.test.ts +++ b/test/skill-e2e.test.ts @@ -2911,6 +2911,128 @@ Write the full output (including the GATE verdict) to ${codexDir}/codex-output.m }, 360_000); }); +// --- Office Hours Spec Review E2E --- + +describeIfSelected('Office Hours Spec Review E2E', ['office-hours-spec-review'], () => { + let ohDir: string; + + beforeAll(() => { + ohDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-oh-spec-')); + const run = (cmd: string, args: string[]) => + spawnSync(cmd, args, { cwd: ohDir, stdio: 'pipe', timeout: 5000 }); + + run('git', ['init', '-b', 'main']); + run('git', ['config', 'user.email', 'test@test.com']); + run('git', ['config', 'user.name', 'Test']); + fs.writeFileSync(path.join(ohDir, 'README.md'), '# Test Project\n'); + run('git', ['add', '.']); + run('git', ['commit', '-m', 'init']); + + // Copy office-hours skill + fs.mkdirSync(path.join(ohDir, 'office-hours'), { recursive: true }); + fs.copyFileSync( + path.join(ROOT, 'office-hours', 'SKILL.md'), + path.join(ohDir, 'office-hours', 'SKILL.md'), + ); + }); + + afterAll(() => { + try { fs.rmSync(ohDir, { recursive: true, force: true }); } catch {} + }); + + test('/office-hours SKILL.md contains spec review loop', async () => { + const result = await runSkillTest({ + prompt: `Read office-hours/SKILL.md. I want to understand the spec review loop. + +Summarize what the "Spec Review Loop" section does — specifically: +1. How many dimensions does the reviewer check? +2. What tool is used to dispatch the reviewer? +3. What's the maximum number of iterations? +4. What metrics are tracked? + +Write your summary to ${ohDir}/spec-review-summary.md`, + workingDirectory: ohDir, + maxTurns: 8, + timeout: 120_000, + testName: 'office-hours-spec-review', + runId, + }); + + logCost('/office-hours spec review', result); + recordE2E('/office-hours-spec-review', 'Office Hours Spec Review E2E', result); + expect(result.exitReason).toBe('success'); + + const summaryPath = path.join(ohDir, 'spec-review-summary.md'); + if (fs.existsSync(summaryPath)) { + const summary = fs.readFileSync(summaryPath, 'utf-8').toLowerCase(); + // Verify the agent understood the key concepts + expect(summary).toMatch(/5.*dimension|dimension.*5|completeness|consistency|clarity|scope|feasibility/); + expect(summary).toMatch(/agent|subagent/); + expect(summary).toMatch(/3.*iteration|iteration.*3|maximum.*3/); + } + }, 180_000); +}); + +// --- Plan CEO Review Benefits-From E2E --- + +describeIfSelected('Plan CEO Review Benefits-From E2E', ['plan-ceo-review-benefits'], () => { + let benefitsDir: string; + + beforeAll(() => { + benefitsDir = fs.mkdtempSync(path.join(os.tmpdir(), 'skill-e2e-benefits-')); + const run = (cmd: string, args: string[]) => + spawnSync(cmd, args, { cwd: benefitsDir, stdio: 'pipe', timeout: 5000 }); + + run('git', ['init', '-b', 'main']); + run('git', ['config', 'user.email', 'test@test.com']); + run('git', ['config', 'user.name', 'Test']); + fs.writeFileSync(path.join(benefitsDir, 'README.md'), '# Test Project\n'); + run('git', ['add', '.']); + run('git', ['commit', '-m', 'init']); + + // Copy plan-ceo-review skill + fs.mkdirSync(path.join(benefitsDir, 'plan-ceo-review'), { recursive: true }); + fs.copyFileSync( + path.join(ROOT, 'plan-ceo-review', 'SKILL.md'), + path.join(benefitsDir, 'plan-ceo-review', 'SKILL.md'), + ); + }); + + afterAll(() => { + try { fs.rmSync(benefitsDir, { recursive: true, force: true }); } catch {} + }); + + test('/plan-ceo-review SKILL.md contains prerequisite skill offer', async () => { + const result = await runSkillTest({ + prompt: `Read plan-ceo-review/SKILL.md. Search for sections about "Prerequisite" or "office-hours" or "design doc found". + +Summarize what happens when no design doc is found — specifically: +1. Is /office-hours offered as a prerequisite? +2. What options does the user get? +3. Is there a mid-session detection for when the user seems lost? + +Write your summary to ${benefitsDir}/benefits-summary.md`, + workingDirectory: benefitsDir, + maxTurns: 8, + timeout: 120_000, + testName: 'plan-ceo-review-benefits', + runId, + }); + + logCost('/plan-ceo-review benefits-from', result); + recordE2E('/plan-ceo-review-benefits', 'Plan CEO Review Benefits-From E2E', result); + expect(result.exitReason).toBe('success'); + + const summaryPath = path.join(benefitsDir, 'benefits-summary.md'); + if (fs.existsSync(summaryPath)) { + const summary = fs.readFileSync(summaryPath, 'utf-8').toLowerCase(); + // Verify the agent understood the skill chaining + expect(summary).toMatch(/office.hours/); + expect(summary).toMatch(/design doc|no design/i); + } + }, 180_000); +}); + // Module-level afterAll — finalize eval collector after all tests complete afterAll(async () => { if (evalCollector) { diff --git a/test/skill-validation.test.ts b/test/skill-validation.test.ts index ea683762..f4405a25 100644 --- a/test/skill-validation.test.ts +++ b/test/skill-validation.test.ts @@ -644,6 +644,59 @@ describe('office-hours skill structure', () => { test('contains builder operating principles', () => { expect(content).toContain('Delight is the currency'); }); + + // Spec Review Loop (Phase 5.5) + test('contains spec review loop', () => { + expect(content).toContain('Spec Review Loop'); + }); + + test('contains adversarial review dimensions', () => { + for (const dim of ['Completeness', 'Consistency', 'Clarity', 'Scope', 'Feasibility']) { + expect(content).toContain(dim); + } + }); + + test('contains subagent dispatch instruction', () => { + expect(content).toMatch(/Agent.*tool|subagent/i); + }); + + test('contains max 3 iterations', () => { + expect(content).toMatch(/3.*iteration|maximum.*3/i); + }); + + test('contains quality score', () => { + expect(content).toContain('quality score'); + }); + + test('contains spec review metrics path', () => { + expect(content).toContain('spec-review.jsonl'); + }); + + test('contains convergence guard', () => { + expect(content).toMatch(/convergence/i); + }); + + // Visual Sketch (Phase 4.5) + test('contains visual sketch section', () => { + expect(content).toContain('Visual Sketch'); + }); + + test('contains wireframe generation', () => { + expect(content).toMatch(/wireframe|sketch/i); + }); + + test('contains DESIGN.md awareness', () => { + expect(content).toContain('DESIGN.md'); + }); + + test('contains browse rendering', () => { + expect(content).toContain('$B goto'); + expect(content).toContain('$B screenshot'); + }); + + test('contains rough aesthetic instruction', () => { + expect(content).toMatch(/rough|hand-drawn/i); + }); }); describe('investigate skill structure', () => { @@ -856,6 +909,22 @@ describe('CEO review mode validation', () => { expect(content).toContain('HOLD SCOPE'); expect(content).toContain('REDUCTION'); }); + + // Skill chaining (benefits-from) + test('contains prerequisite skill offer for office-hours', () => { + expect(content).toContain('Prerequisite Skill Offer'); + expect(content).toContain('/office-hours'); + }); + + test('contains mid-session detection', () => { + expect(content).toContain('Mid-session detection'); + expect(content).toMatch(/still figuring out|seems lost/i); + }); + + // Spec review on CEO plans + test('contains spec review loop for CEO plan documents', () => { + expect(content).toContain('Spec Review Loop'); + }); }); // --- gstack-slug helper ---