diff --git a/templates/references/self-improvement.md b/templates/references/self-improvement.md index ed06262e..e7f8c63d 100644 --- a/templates/references/self-improvement.md +++ b/templates/references/self-improvement.md @@ -43,6 +43,10 @@ at the end of every session in a MaxsimCLI project. Each entry records: - Which tasks were attempted repeatedly without a commit (likely failed). - Long-term trends across many sessions. +### MEMORY.md Size Limit + +Keep MEMORY.md under 200 lines (the Claude Code context loading limit). The `maxsim-capture-learnings` Stop hook enforces this by pruning at 180 lines, leaving headroom. When the file approaches the limit, the oldest entries are removed first. Each entry should be concise (3-5 lines) to maximize the number of sessions that fit. + --- ## 3. Results Tracking diff --git a/templates/skills/autoresearch/references/loop-protocol.md b/templates/skills/autoresearch/references/loop-protocol.md index 3ed6dbe4..b090036c 100644 --- a/templates/skills/autoresearch/references/loop-protocol.md +++ b/templates/skills/autoresearch/references/loop-protocol.md @@ -93,10 +93,11 @@ If verification exceeds 2x normal time, kill and treat as crash. Some metrics are inherently noisy (benchmark times, ML accuracy). Strategies: -- **Multi-run verification:** Run verify N times, use the median. -- **Minimum improvement threshold:** Ignore improvements smaller than the noise floor. -- **Confirmation run:** Re-verify before making a final keep decision. -- **Environment pinning:** Pin random seeds, use deterministic test ordering, flush caches. +- **For improvements of 1–5%:** Run the verify command 3 times and use the median result. +- **For improvements >5%:** Run the verify command 5 times and use the median result. +- **Minimum improvement threshold:** Ignore improvements smaller than the noise floor (typically 0.5% for benchmarks). +- **Confirmation run:** After accepting an improvement, re-verify once more before making the final keep decision. +- **Environment pinning:** Pin random seeds, use deterministic test ordering, flush caches between runs. ## Phase 5.5: Guard (Regression Check) diff --git a/templates/skills/verification/SKILL.md b/templates/skills/verification/SKILL.md index 5db7fc6f..a55096b1 100644 --- a/templates/skills/verification/SKILL.md +++ b/templates/skills/verification/SKILL.md @@ -167,3 +167,42 @@ Do not attempt a 4th run without user acknowledgment and revised instructions. | Skipping Gate 4 after Gate 3 passes | Declaring done without regression check | Gate 3 and Gate 4 are both required; neither is optional | | Conflating "no errors" with "correct output" | Exit code 0 but wrong behavior | Evidence must show correct output, not just absence of error | | Writing evidence after the fact | Constructing output from memory | Run the command, capture the output, paste it verbatim | + +--- + +## 5-Step Verification Process + +When verification fails, follow this structured process: + +1. **Run the check command one final time** — capture fresh output as evidence +2. **Construct diagnostic summary** — compare spec expectations vs actual output +3. **Identify root cause** — is it a spec problem, environment problem, or implementation problem? +4. **Propose next step** — rewrite spec, fix environment, reduce scope, or escalate +5. **Escalate if unresolved** — create a diagnostic GitHub Issue with all evidence + +--- + +## GitHub Issue Escalation + +When a task fails verification after 3 attempts, escalate by creating (or commenting on) a GitHub Issue: + +1. **Original task spec** — quoted from the plan comment +2. **What was attempted** — brief factual summary of each attempt +3. **The specific gate that failed** — exact error output from each run +4. **Root cause analysis** — spec/environment/implementation classification +5. **Proposed next step** — rewrite spec, fix environment, reduce scope, or request user input + +Label the issue with `type:bug` and `maxsim:auto`. + +--- + +## Fresh Executor Context + +Each retry attempt MUST use a fresh executor agent: + +- Do NOT reuse the previous executor (spawn a new one) +- Provide the full task spec (do not assume prior context carries over) +- Include the diagnostic summary from the failed run +- Include revised instructions based on root cause analysis + +Treat each fresh executor as a cold start. Do NOT reference or build upon any previous attempt's reasoning or partial work.