feat: /codebase-audit — full pipeline from diagnosis to fix plan to review by boinger · Pull Request #266 · garrytan/gstack

boinger · 2026-03-21T04:16:49Z

Audit → fix plan → review → implement. One command.

/codebase-audit reads a codebase cold, produces a structured report (health score, findings by severity, architecture diagram), writes a fix plan, and chains into /plan-eng-review for substantive items. Full pipeline in one session.

Where /review checks a diff, this checks everything — and goes further by splitting findings into mechanical fixes (apply now) and substantive fixes (review first).

Run it against any codebase — one you inherited, one you wrote last month, or one you just cloned.

Tested against 4 production codebases across 4 languages:

Project	Language	Architecture	Score	Key Finding
confvis	Go	CLI tool	77/100	Missing error wrapping in 12 public functions
ankermake-m5-protocol	Python/Flask	IoT web server	62/100	WebSocket endpoints bypass API key auth
unifi-network-mcp	Python	MCP server	62/100	Shallow-copy bug in 8 mutation methods (same class as a recently fixed bug — "fix one, miss eight")
sauna-controller-esp32	C++/Swift	Embedded firmware + iOS app	52/100	IEEE 754 NaN bypasses all safety checks on a 7kW heater relay

Each test ran the full pipeline: audit → AskUserQuestion → user selects /plan-eng-review → eng review runs in the same session. The confvis run went further: mechanical fixes applied, lint clean, all 29 test packages pass.

Design decisions

Plan mode native. The audit is "planning-for-a-plan." Phases 1-3 are read-only research. Phase 4 writes the report via Bash heredoc (bypasses plan mode's Write restriction) and produces the fix plan as the plan file. No fighting the tool — the plan file IS the natural output.

Review chaining. After the plan is written, AskUserQuestion offers /plan-eng-review, /plan-ceo-review, accept as-is, or edit first. The Skill tool is invoked immediately on selection — before plan mode's "Ready to code?" can intercept. Mechanical fixes (Part 1) can be applied directly; substantive fixes (Part 2) route through the review pipeline.

Health score calibration. 100-point scale: critical=-25, important=-10, notable=-3, opportunity=0. Original weights (-25/-15/-5) scored a well-built Go CLI at 30/100. Recalibrated so the same codebase scores 77/100 — matches the intuition of "solid code with some gaps."

Grep discipline. Phase 3 uses files_with_matches mode exclusively. Content-mode grep on broad patterns matched entire files and flooded context. Rule 14 bans content-mode during checklist execution.

Knowledge cutoff awareness. Rule 11 prevents false positives from flagging valid-but-unfamiliar versions (e.g., Go 1.25) as nonexistent.

What's included

codebase-audit/SKILL.md.tmpl — Template with 4 phases, 14 rules, plan mode integration, review chaining
codebase-audit/checklist.md — 70-item audit checklist across 7 categories with [QUICK] tags
codebase-audit/report-template.md — Structured report format
codebase-audit/references/patterns.md — Language-specific anti-patterns (JS/TS, Python, Ruby, Go, Rust, Swift, PHP)
.agents/skills/gstack-codebase-audit/SKILL.md — Same skill content, paths adapted for OpenAI Codex CLI's .agents/skills/ discovery convention
Tests: structural validation, LLM-judge eval, E2E eval for --quick mode, touchfile entry
docs/skills.md deep dive, README.md updates, CHANGELOG.md v0.9.5.0

Competitive landscape

No open PR on this repo covers this scope. Closest:

feat: shared test coverage audit across plan/ship/review (v0.9.5.0) #259 (test-coverage-catalog) — tests only, one of our 7 categories
feat: add /cso skill — OWASP Top 10 + STRIDE security audit #155 (/cso) — security only, no health scoring, no baseline tracking, no review chaining
feat: add /hyper-plan skill — recursive codebase improvement with convergence scoring #166 (/hyper-plan) — iterative optimizer that loops fix→verify→rescore; complementary, not competing

Follow-up scope (tracked in fork)

Deferred to a follow-up PR to keep this one focused on the core experience:

Focused mode flags (--security-only, --tests-only, etc.) — run only matching checklist categories
CI mode (--ci --min-score N) — non-interactive, JSON-only output, exit-code semantics for quality gates
HTML report format — collapsible sections, syntax highlighting, visual score indicator
Auto-fix suggestions — include code diffs alongside findings (not applied, just suggested)
Cross-repo comparison — aggregate baseline.json across multiple projects for fleet-level health view

Test plan

bun test — all free tests pass
bun run gen:skill-docs --dry-run — all FRESH (Claude + Codex hosts)
Tested against 4 codebases: Go CLI, Python/Flask, Python MCP, C++/Swift embedded
Review chaining works end-to-end (audit → AskUserQuestion → /plan-eng-review invoked)
Regression mode works (sauna controller had a previous baseline — delta shown)
Plan mode compatible (report via Bash heredoc, plan as plan file)

🤖 Generated with Claude Code

Add three sections for contributor onboarding: template placeholder reference (all 10 resolvers), browse subsystem architecture (daemon model, state file, command dispatch, ref system, logging), and test infrastructure internals (touchfiles, session-runner, eval-store, llm-judge, observability). Add standard header.

Syncs boinger/gstack with garrytan/gstack upstream. Includes new skills (freeze, careful, guard, investigate, office-hours, codex, unfreeze), Gemini CLI support, telemetry infrastructure, Node-compatible server bundle, and Codex CLI e2e tests.

- Add to ALL_SKILLS in gen-skill-docs.test.ts - Add to all 4 preamble validation arrays in skill-validation.test.ts - Add structural validation tests: checklist categories, [QUICK] tags, report template sections, language patterns, phase markers - Add LLM-judge eval for SKILL.md quality scoring - Add E2E eval for --quick mode smoke test - Add touchfile dependencies for diff-based test selection

- README: skills table, count (thirteen → fourteen), install prompts, add-to-project prompt, troubleshooting snippet - docs/skills.md: table entry + deep dive section with philosophy, modes, health scoring, when-to-use comparison table, and example - CHANGELOG: v0.6.5.0 entry

After the audit report, the skill now offers four options: - Show all findings inline - Fix selected findings (pick by number, creates atomic commits) - Quick fixes only (mechanical fixes like .gitignore, exception narrowing, missing timeouts — no judgment calls) - Done (review report later) This bridges the gap between "here's what's wrong" and "let me fix it" without requiring a separate session. The audit phase remains read-only; fix mode lifts the constraint after user opt-in.

The "Fix selected findings" path now distinguishes between mechanical fixes (apply directly) and substantive fixes (recommend /plan-eng-review and optionally /plan-ceo-review before executing). This follows the gstack philosophy of using the review pipeline for quality assurance on anything beyond trivial changes.

…sity Two issues from live testing: 1. Substantive fixes now require an explicit AskUserQuestion offering /plan-eng-review before implementation. Previous recommend language was treated as optional. Now mandatory step with A/B/C options. 2. Checklist pattern execution now uses files_with_matches mode instead of content mode for Grep. Prevents 3000+ line dumps from broad regex matches into the conversation.

Plan mode restricts Write to the plan file path. The audit report needs to write to ~/.gstack/. Added Phase 4.0 requiring ExitPlanMode before any file writes, with fallback messaging if it cannot exit.

…tern Replaces the ambiguous multi-choice next-steps prompt with the same deterministic review chaining pattern used by /plan-ceo-review and /plan-eng-review. Key changes: - Findings are pre-classified as mechanical vs substantive - Default recommendation routes substantive fixes through /plan-eng-review - Mechanical fixes are applied immediately with atomic commits - AskUserQuestion presents concrete options based on finding classification - Section renamed to 'Next Steps - Review Chaining' matching other skills

Two fixes from plan-mode testing: 1. Report writing now uses Bash heredoc instead of Write tool. Plan mode restricts Write to plan files, but Bash is unrestricted. This bypasses the plan mode issue entirely. 2. Added absolute rule 14 banning content-mode Grep during checklist execution. Fixed the defer-in-loops regex in patterns.md which was a multiline pattern that matched entire files (3000+ lines).

The plan file banner now recommends the appropriate review path: - Scope/product decisions → /plan-ceo-review first, then /plan-eng-review - Implementation-level fixes → /plan-eng-review directly This matches how other gstack skills chain reviews based on the nature of the work.

…ve work The key architectural insight: plan mode owns the session after the plan file is written. So mechanical fixes must be applied BEFORE writing the plan. Then 'Ready to code?' only covers substantive fixes, and the banner correctly says to run /plan-eng-review first. Previously both mechanical and substantive fixes were in the plan, making 'Ready to code?' ambiguous.

Fundamental reframe: the audit IS planning. Phases 1-3 are read-only research (compatible with plan mode). Phase 4 produces two outputs: the archival report (to ~/.gstack/ via Bash) and the fix plan (to the plan file). 'Ready to code?' means 'execute this fix plan.' The plan file now has two parts: - Part 1: Mechanical fixes (apply immediately on execution) - Part 2: Substantive fixes (banner recommends /plan-eng-review first) Removes all plan-mode-fighting instructions. No more ExitPlanMode, no more 'do NOT write to plan files.' The plan file is the correct output — this is how other gstack skills work.

After the fix plan is written, the skill now presents an AskUserQuestion offering to run /plan-eng-review or /plan-ceo-review directly. The user can say yes, make changes, or accept as-is. This replaces the previous approach of ending with 'Ready to code?' which required the user to manually type /plan-eng-review. Now the skill escorts the user to the next step, matching how /plan-ceo-review chains to /plan-eng-review.

Regenerate .agents/skills/ for both codebase-audit (template updates) and retro (upstream v0.9.4.1 fix propagation).

- Swap rules 13/14 in template (14 appeared before 13) - Remove "project you've never seen before" framing — the audit works on any codebase regardless of familiarity - Lead CHANGELOG with fix pipeline, not just report generation

The first real-world run of /codebase-audit against gstack surfaced bugs in the skill template itself: - Step 1.2 ls -la exits non-zero when build files are missing, which cascades and cancels all parallel tool calls. Add || true. - Step 4.1 uses cat via Bash to read the report template, contradicting Key Rule 12 (always use the Read tool). Replace with Read instruction. - Step 1.3 LOC count includes non-code files (images, JSON, lockfiles), inflating the count and triggering incorrect Large codebase scoping. Filter by common source code extensions. - Baseline finding IDs were specified as SHA256 hashes but no mechanism was provided to compute them, making regression comparison impossible. Add shasum snippet.

boinger · 2026-03-21T05:40:56Z

/codebase-audit — first run against gstack main

Ran /codebase-audit against gstack main (1f4b6fd) — 67/100, 0 critical, 7 important findings.

#	Finding	Severity	Location
1	Missing path validation on `upload` command	Important	`write-commands.ts:238`
2	Telemetry ingest has no authentication	Important	`supabase/functions/telemetry-ingest/`
3	Missing rate limiting on Supabase endpoints	Important	`supabase/functions/`
4	No test coverage for Supabase edge functions	Important	`supabase/functions/`
5	gen-skill-docs.ts is a 1,785-line god module	Important	`scripts/gen-skill-docs.ts`
6	Page event listeners not cleaned up on tab close	Important	`browser-manager.ts:555`
7	Hardcoded stale fallback version in update-check	Important	`supabase/functions/update-check/`

The strongest cluster is the Supabase edge functions — three related findings (no auth, no rate limiting, no tests) in one subsystem. The telemetry-ingest edge function has no request-level
authentication and executes with service role privileges (bypassing RLS). Low-severity given the data is non-sensitive telemetry, but worth hardening. The update-check function has a stale fallback
version (0.6.4.1 vs current 0.9.4.1). Zero test files cover any of the three deployed Deno functions.

The browse engine scored well — strong URL validation, path traversal protection, auth tokens, ~150 integration tests. One gap: upload is the only file-touching command that skips
validateReadPath().'

boinger added 20 commits March 18, 2026 08:30

Merge branch 'garrytan:main' into main

8f6cfb1

merge upstream/main into fork

4115eb9

Syncs boinger/gstack with garrytan/gstack upstream. Includes new skills (freeze, careful, guard, investigate, office-hours, codex, unfreeze), Gemini CLI support, telemetry infrastructure, Node-compatible server bundle, and Codex CLI e2e tests.

Merge remote-tracking branch 'upstream/main'

1908fe0

fix: explicit plan mode exit before report generation

4b0909f

Plan mode restricts Write to the plan file path. The audit report needs to write to ~/.gstack/. Added Phase 4.0 requiring ExitPlanMode before any file writes, with fallback messaging if it cannot exit.

chore: regenerate Codex SKILL.md files after rebase

37b249a

Regenerate .agents/skills/ for both codebase-audit (template updates) and retro (upstream v0.9.4.1 fix propagation).

fix: rule numbering and changelog language

a8f4a14

- Swap rules 13/14 in template (14 appeared before 13) - Remove "project you've never seen before" framing — the audit works on any codebase regardless of familiarity - Lead CHANGELOG with fix pipeline, not just report generation

fix: force Skill tool invocation before plan mode takes over

6826f7e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: /codebase-audit — full pipeline from diagnosis to fix plan to review#266

feat: /codebase-audit — full pipeline from diagnosis to fix plan to review#266
boinger wants to merge 20 commits intogarrytan:mainfrom
boinger:feat/codebase-audit

boinger commented Mar 21, 2026

Uh oh!

boinger commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

boinger commented Mar 21, 2026

Audit → fix plan → review → implement. One command.

Design decisions

What's included

Competitive landscape

Follow-up scope (tracked in fork)

Test plan

Uh oh!

boinger commented Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant