Make Auto Review snapshot-aware and preserve superseded results

## Summary

Auto Review often finishes after the work it reviewed has already moved on or merged. The findings can then arrive in the active LLM session as if they apply to the current state, even when they target an older snapshot. We should make Auto Review snapshot-aware, preserve superseded review artifacts, and evaluate model/follow-up settings before changing defaults.

## Current Evidence

Local session catalog sample from `/Users/cbusillo/.code/sessions/index/catalog.jsonl`:

- 221 Auto Review sessions found.
- Median duration: about 5.3 minutes.
- 75th percentile: about 14.2 minutes.
- 90th percentile: about 39.5 minutes.
- Mean duration: about 16.2 minutes.
- 41 sessions exceeded 20 minutes.
- 18 sessions exceeded 60 minutes.
- Longest observed session was about 163.6 minutes.

Developer-message history also showed 280 background Auto Review completion notifications spread across 9 main sessions, with one session receiving 83 completion messages. This matches the observed behavior where findings can arrive repeatedly after the main work has already continued.

Current local config appears to use:

```toml
auto_review_model = "gpt-5.4-mini"
auto_review_model_reasoning_effort = "low"
auto_review_resolve_model = "gpt-5.4"
auto_review_resolve_model_reasoning_effort = "high"
auto_review_use_chat_model = false
auto_review_followup_attempts = 10
review_auto_resolve = true
```

This suggests initial review may be relatively cheap, while auto-resolve follow-up loops can become long, expensive sessions.

## Problem

Auto Review currently behaves like an asynchronous reviewer but reports into the active session as if its result is current. That creates several issues:

- Late findings can target a stale commit or branch state.
- The active LLM may not know whether Auto Review is still running.
- Superseded review work can still be useful, but it should not be treated as fresh instructions.
- Up to 10 auto-resolve follow-ups can turn a background review into a long-running workstream.
- Changing model/follow-up defaults without evidence could make review faster but worse.

## Desired Behavior

- Track Auto Review state by `(repo, branch/worktree, target snapshot, diff base)`.
- Expose active Auto Review state to the main agent/session.
- On completion, compare the review target snapshot to current HEAD or the current active work snapshot.
- If the result is current, report it normally.
- If the result is superseded, preserve the worktree/artifact and report it as stale/superseded instead of injecting it as active instructions.
- Keep superseded findings discoverable for manual adoption.
- Prevent duplicate reviews for the same snapshot from producing repeated active interruptions.

## Proposed Approach

1. Add snapshot metadata to Auto Review lifecycle records and completion messages.
2. Add a completion gate that classifies results as current or superseded before surfacing them to the main session.
3. Preserve superseded artifacts and worktrees, but route them to a review inbox/log instead of the active instruction stream.
4. Add status visible to the active agent: idle/running/current findings/superseded/cancelled, target snapshot, age, worktree path, and auto-resolve phase.
5. Build a replay/evaluation harness from past Auto Review sessions before changing model defaults or follow-up limits.

## Model And Follow-Up Study

Questions to answer before tuning defaults:

- Does `gpt-5.4-mini` at low reasoning produce noisy findings that trigger unnecessary auto-resolve loops?
- Would a stronger reviewer with fewer follow-ups reduce total wall time and interruptions while preserving quality?
- Is `gpt-5.4` high reasoning necessary for every resolve pass?
- What is the quality/speed tradeoff for one or two follow-ups versus the current limit of 10?

Suggested variants to compare on replayed historical cases:

- Current config.
- Same reviewer, one follow-up.
- Stronger reviewer, one follow-up.
- Stronger reviewer, no auto-resolve.
- Current reviewer, resolve at medium reasoning.

Score for finding correctness, duplicate/noisy findings, stale result handling, wall time, number of turns, and whether the final result materially improved after follow-ups.

## Acceptance Criteria

- Auto Review completion messages include target snapshot metadata.
- Superseded Auto Review results are not injected as active instructions.
- Superseded findings and worktrees remain discoverable.
- Active sessions can see whether Auto Review is running and what snapshot it targets.
- Duplicate/completed reviews for the same snapshot do not repeatedly interrupt the user or agent.
- Model/follow-up default changes are backed by replay evidence.

## Current Status

State: Issue created from local investigation; no code changes yet.
Next action: Inventory Auto Review lifecycle code and design the smallest snapshot-aware completion gate.
Blocked by: Need decision on whether to implement stale-result gating before model/follow-up tuning.
Last verified: 2026-05-13 local logs/config inspection.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Auto Review snapshot-aware and preserve superseded results #76

Summary

Current Evidence

Problem

Desired Behavior

Proposed Approach

Model And Follow-Up Study

Acceptance Criteria

Current Status

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Make Auto Review snapshot-aware and preserve superseded results #76

Description

Summary

Current Evidence

Problem

Desired Behavior

Proposed Approach

Model And Follow-Up Study

Acceptance Criteria

Current Status

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions