diff --git a/docs/beam-10m-smoke-plan.md b/docs/beam-10m-smoke-plan.md deleted file mode 100644 index 8088d81..0000000 --- a/docs/beam-10m-smoke-plan.md +++ /dev/null @@ -1,150 +0,0 @@ -# BEAM-10M Smoke + Multirun Plan - -**Date:** 2026-05-07 -**Branch:** `worktree-tbc-prototype` -**Status:** plan only. Smoke runs in T3.2 (post-T2.4 wire-up); full multirun in T4.1. - ---- - -## 1. Dataset acquisition - -**Source:** `Mohammadta/BEAM-10M` on HuggingFace. -**Shape:** 10 conversations × 200 total questions (20 per conv × 10 abilities × 2 each). -**Approx context:** ~1.4M tokens per conversation × 10 = ~14M tokens total. -**On-disk size:** ~140 MB JSON. - -Acquisition command: -```bash -huggingface-cli download Mohammadta/BEAM-10M --repo-type dataset \ - --local-dir atomicmemory-benchmarks/data/beam-10m -``` - -Then `src/eval/beam-10m-loader.ts::loadBeam10MDataset()` parses the local JSON. The current loader is a typed stub; the real implementation lands when the file is in place. - ---- - -## 2. Cost estimate - -Per `estimateBeamCost()` (per-seed, full 10-conv × 200-question run): - -| Component | Cost | -|---|---| -| Ingest (10 convs × ~150 facts × $0.002/fact) | ~$3.00 | -| Hierarchical summaries (10 convs × ~50 sessions × $0.001/session + $0.005/conv-summary) | ~$0.55 | -| Search + answer + judge (200 q × $0.10/q multi-iter) | ~$20.00 | -| **Per-seed total** | **~$23.55** | -| **Multirun n=3 with bootstrap CI** | **~$70.65** | - -Numbers are conservative — actual BEAM-100K runs came in ~50% under the per-question estimate. Likely real cost: **$60–80 for full n=3 multirun**. - -LiteLLM remaining budget: $68 of $100 → fits comfortably with no extra spend approval. - ---- - -## 3. Smoke test scope - -| Property | Value | -|---|---| -| Conversations | conv-1 only | -| Questions | 20 (smoke = full conv-1 question set) | -| Seeds | n=1 | -| Stack | H1.1 (token budget=4000, top_k=100) + classifier on + ability_hint OFF + TBC ON + hierarchical ON | -| Backbone | Haiku 4.5 via LiteLLM | -| Cost cap | **$5** (hard cap; abort if exceeded) | -| Wall-time cap | 30 minutes | - -**Why these knobs:** every Phase 1–2 lever set to its current best value; Phase 3 architectural additions (TBC + hierarchical) enabled to validate they don't crash at 10M scale. - -### Smoke success criteria - -| Criterion | Threshold | Action if missed | -|---|---|---| -| No crashes / unhandled errors | 0 | block T4.1, diagnose | -| Wall time | ≤ 30 min | tune top_k or skip TBC for multirun | -| LiteLLM cost | ≤ $5 | scale knobs down before multirun | -| AM ingest completes for 1.4M-token conversation | yes | check for token-limit errors; chunk if needed | -| All 20 questions answered (no parse-errors) | ≥ 18/20 | investigate judge prompt; lower threshold ok | -| Composite ≥ 0.40 on conv-1 | yes | gates the $70+ multirun | - -A composite of **0.40** on conv-1 is a sanity floor (matches conv-1 BEAM-100K performance). If conv-1 BEAM-10M is below 0.40, the architecture isn't reaching the right facts at scale and a multirun would burn budget on a known-bad config. - ---- - -## 4. Multirun protocol (T4.1) - -**Trigger:** smoke test passes all success criteria. - -| Property | Value | -|---|---| -| Conversations | all 10 (full BEAM-10M canonical set) | -| Questions | 200 | -| Seeds | n=3 | -| Bootstrap | 95% CI over per-conv composites | -| Backbone | Haiku 4.5 | -| Hard cost cap | $130 (deny on overrun) | -| Hard wall-time cap | 8 hours | - -### Multirun seeds - -Seeds are different **answer-LLM seeds** + **session-summary-LLM seeds**, both forwarded via the existing seed plumbing. Ingest is deterministic per fixture so seeds only affect generation steps. - -### Result reporting - -| File | Contents | -|---|---| -| `data/beam-10m/results-T4.1-.json` | per-conv per-question per-seed verdicts | -| `data/beam-10m/results-T4.1--summary.md` | composite + per-ability breakdown + bootstrap CI | -| `verified-results-extended.csv` row | `AM TBC+hierarchical, Haiku 4.5, BEAM-10M, , n=3` | - -### Comparison targets - -| System | Reference | Source | -|---|---|---| -| Mem0 OSS BEAM-10M | 0.486 | `mem0ai/memory-benchmarks` published | -| Truncation baseline (this sprint, BEAM-10M) | TBD via T4.0 | needs separate $25 truncation run | -| AM H1.1 + TBC + hierarchical (this run) | TBD | T4.1 output | - ---- - -## 5. Decision gate after T4.1 - -| 3-conv composite mean | Interpretation | Next action | -|---|---|---| -| **≥ 0.55** | Strong SOTA over Mem0 0.486 (+0.06+) | Write **paper variant 5.1-A**: "AM beats Mem0 on BEAM-10M with TBC + hierarchical" | -| **0.49–0.55** | Marginal SOTA (+0.00–0.06) | Write **paper variant 5.1-B**: "AM matches/edges Mem0 on BEAM-10M with reproducible OSS architecture" | -| **0.45–0.49** | Tied within noise | Methodology paper (5.1-C) + BEAM-100K headline; framing notes 10M parity | -| **< 0.45** | Below Mem0 baseline | Pivot fully to methodology paper (5.1-C) using existing BEAM-100K data | - -**Probability estimates** (based on Haiku BEAM-100K +0.10 lift from H1.1 + Tier 2 unmeasured architectural lift): - -| Outcome | Probability | -|---|---| -| ≥ 0.55 (clean SOTA) | ~25% | -| 0.49–0.55 (marginal SOTA) | ~30% | -| 0.45–0.49 (tied) | ~25% | -| < 0.45 (below baseline) | ~20% | - -Combined "we hit SOTA-or-tied" probability: **~80%.** Fallback paper exists either way. - ---- - -## 6. Pre-T4.1 dependencies - -Before firing the multirun, these must land: - -| Dependency | Status | Blocker for T4.1? | -|---|---|---| -| T2.4: hierarchical arm wired into memory-search.ts RRF fusion | not yet defined | yes (without this, hierarchical does nothing) | -| TBC dual-write hook installed in production runtime | not yet defined | only if TBC is in the run | -| BEAM-10M dataset downloaded locally | T3.2 follow-up | yes | -| AM server restarted with `HIERARCHICAL_RETRIEVAL_ENABLED=true` and `TBC_ENABLED=true` | env config | yes | -| BEAM-10M smoke (conv-1) passes all criteria | T3.2 | yes | - ---- - -## 7. What we are NOT doing in this sprint - -- BEAM-1M tier (managed-platform-only Mem0 number, unreproducible) -- Cross-backbone GPT-5 BEAM-10M (cost-prohibitive given the existing $32 spent + remaining headroom) -- Multi-bench validation (LongMemEval, MultiSessionChat) — defer to follow-up sprint -- Custom retrievers / trained classifiers (research-grade, multi-month) diff --git a/docs/hierarchical-retrieval.md b/docs/hierarchical-retrieval.md deleted file mode 100644 index ea8479f..0000000 --- a/docs/hierarchical-retrieval.md +++ /dev/null @@ -1,239 +0,0 @@ -# Hierarchical Retrieval — Design (T2.1) - -**Date:** 2026-05-06 -**Branch:** `worktree-tbc-prototype` -**Status:** design only. Implementation in T2.2 (session-summary generation) and T2.3 (5th RRF arm). -**Target:** BEAM-10M tier (10 conversations × ~1.4M tokens each = ~14M total context per system). - ---- - -## Why hierarchical retrieval is needed for BEAM-10M - -**BEAM-100K** (the tier we've been running) has 3 conversations × ~33k tokens each = ~100k total context. Top-K=100 over a flat vector index can plausibly recall the right facts. - -**BEAM-10M** has 10 conversations × ~1.4M tokens each ≈ **14M total context**. The fact store grows to ~3,000–5,000 atomic claims per system. A flat vector retrieval over 5,000 facts at top-K=100 returns 2% of the store; if the right facts are not in that 2%, the answer LLM has nothing useful. - -Hindsight publishes 0.486 BEAM-10M with their TEMPR architecture (4 retrieval arms + cross-encoder rerank + token budget). Their advantage at this tier is precisely the **multi-arm retrieval** that separately surfaces (a) topically similar facts, (b) lexically matching facts, (c) entity-graph-connected facts, (d) temporally-relevant facts. - -We have arms (a) and (b) (vector + BM25). We're missing the **hierarchical** retrieval shape that handles 14M-token context: **first pick the right conversation/session, then expand to atomic facts within**. Without this, the 4th and 5th RRF arms (temporal, graph) are also recall-bound by the same flat-store problem. - ---- - -## The architecture - -### Three-level memory hierarchy - -``` - conversation (level 2) - ├── conv_summary (~200 tokens, embedded) - │ - ├── session (level 1) - │ ├── session_summary (~100 tokens, embedded) - │ ├── session_topics (existing FactMetadata field) - │ │ - │ └── atomic claim (level 0 — existing memories table) - │ ├── content - │ ├── embedding - │ ├── classifier metadata (existing) - │ └── belief state (TBC Phase 3 — confidence, tier, edges) -``` - -Level 0 already exists: the `memories` table with per-claim atomic storage. -Level 1 (session) and Level 2 (conversation) are new. - -### Schema additions - -**New tables** (Phase 5 of TBC roadmap): - -```sql -CREATE TABLE IF NOT EXISTS session_summaries ( - id UUID PRIMARY KEY DEFAULT gen_random_uuid(), - user_id TEXT NOT NULL, - session_id TEXT NOT NULL, -- BEAM-style session anchor - conversation_id TEXT NOT NULL, - session_index INTEGER NOT NULL, - summary_text TEXT NOT NULL, -- LLM-generated, ~100 tokens - summary_embedding vector({{EMBEDDING_DIMENSIONS}}) NOT NULL, - topics TEXT[] NOT NULL DEFAULT '{}', -- denormalized from session_topics metadata - fact_count INTEGER NOT NULL DEFAULT 0, - occurred_start TIMESTAMPTZ DEFAULT NULL, - occurred_end TIMESTAMPTZ DEFAULT NULL, - created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), - workspace_id UUID DEFAULT NULL, - agent_id UUID DEFAULT NULL -); - -CREATE TABLE IF NOT EXISTS conv_summaries ( - id UUID PRIMARY KEY DEFAULT gen_random_uuid(), - user_id TEXT NOT NULL, - conversation_id TEXT NOT NULL, - summary_text TEXT NOT NULL, -- LLM-generated, ~200 tokens - summary_embedding vector({{EMBEDDING_DIMENSIONS}}) NOT NULL, - session_count INTEGER NOT NULL DEFAULT 0, - fact_count INTEGER NOT NULL DEFAULT 0, - occurred_start TIMESTAMPTZ DEFAULT NULL, - occurred_end TIMESTAMPTZ DEFAULT NULL, - created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), - workspace_id UUID DEFAULT NULL, - agent_id UUID DEFAULT NULL -); - -CREATE INDEX IF NOT EXISTS idx_session_summaries_user_conv - ON session_summaries (user_id, conversation_id, session_index); -CREATE INDEX IF NOT EXISTS idx_session_summaries_embedding - ON session_summaries USING hnsw (summary_embedding vector_cosine_ops) - WITH (m = 16, ef_construction = 200); -CREATE INDEX IF NOT EXISTS idx_conv_summaries_user - ON conv_summaries (user_id, conversation_id); -CREATE INDEX IF NOT EXISTS idx_conv_summaries_embedding - ON conv_summaries USING hnsw (summary_embedding vector_cosine_ops) - WITH (m = 16, ef_construction = 200); -``` - ---- - -## When summaries are generated - -| Granularity | Trigger | Latency budget | Cost per summary | -|---|---|---|---| -| Session summary | end-of-session ingest (last batch chunk lands) | 1-3 s | ~$0.001 (Haiku, ~500 input tokens) | -| Conversation summary | end-of-conversation ingest (final session ingested) | 2-5 s | ~$0.005 (Haiku, ~2000 input tokens) | - -The LLM call is gated by config flag `HIERARCHICAL_RETRIEVAL_ENABLED`. When off, no summaries generated, no rows written. - -For BEAM-10M: -- 10 conversations × ~50 sessions each = 500 session summaries × $0.001 = $0.50 -- 10 conversation summaries × $0.005 = $0.05 -- **Total summary-generation cost: ~$0.55 per system per BEAM-10M run.** Negligible. - ---- - -## How retrieval changes — the 5th RRF arm - -### Existing 3-arm pipeline (current AM) -``` -query - ├── vector arm → top-K via pgvector cosine on embeddings - ├── BM25 arm → top-K via tsvector - └── (existing extras: lessons gate, consensus filter) - ↓ - RRF fusion - ↓ - top-K facts -``` - -### New 5-arm pipeline (T2.3) -``` -query - ├── vector arm (existing) - ├── BM25 arm (existing) - ├── temporal arm (H1.2 — pending; date-range parsed from query) - ├── graph arm (H1.3 — pending; entity-link 1-hop expansion) - └── HIERARCHICAL arm (this doc) - ├── stage 1: vector-search conv_summaries → top-3 conversations - ├── stage 2: vector-search session_summaries WHERE conv ∈ stage-1 → top-10 sessions - ├── stage 3: vector-search memories WHERE session ∈ stage-2 → top-50 facts - └── candidates feed into RRF as a 5th arm - ↓ - RRF fusion (k=60) - ↓ - top-300 candidates → cross-encoder rerank - ↓ - fill prompt to SEARCH_TOKEN_BUDGET tokens -``` - -### Why the hierarchical arm complements vector - -Vector arm answers "which atomic facts are most similar to the query?" — fast at small scale, drowns in a 5000-fact store. - -Hierarchical arm answers "which session's *gist* matches the query, and what facts live in that session?" — gives the answerer a coherent slice of conversation rather than a scattered sample. - -For "What did we agree on at the project kickoff?" — vector might return 100 unrelated atoms; hierarchical filters to the kickoff-session summary first, then expands within. - -For "What did we discuss about the API design over the last month?" — hierarchical surfaces 3-5 sessions whose summaries mention "API"; vector arm picks the specific atomic facts inside those sessions; both go through RRF. - ---- - -## Per-ability hypotheses - -Hierarchical arm is hypothesised to lift these BEAM abilities: - -| Ability | Why hierarchical helps | -|---|---| -| **MSR** (multi-session reasoning) | The arm explicitly surfaces 3+ distinct sessions before atomic expansion — exactly what MSR questions need. Currently broken at 0/6 across our runs. | -| **EO** (event ordering) | Session summaries carry `occurred_start`/`occurred_end` time anchors; ordering is geometric. Currently 0/6. | -| **TR** (temporal reasoning) | "Last month" → conv_summary filter by occurred_at range → expand. Cheaper than the standalone temporal RRF arm. | -| **SUM** (summarization) | Conv summaries ARE the summarisation — the arm directly returns them when SUM queries hit. | - -Conservatively expected lift: **+0.10 on BEAM-10M composite** from MSR/EO alone, more if SUM benefits. - ---- - -## Why this is a 5th arm, not a replacement - -A fully hierarchical retriever (LLM walks the document tree, paradigm 5 in the 19-system survey) replaces vector entirely. We **do not** want that — it's slow and expensive. Hindsight, Mem0, and other paradigm-4 systems found that hierarchical retrieval works best as a **fused arm** alongside vector + BM25 + temporal, not as the dominant strategy. - -Our hierarchical arm: -- Returns ~50 atomic-level candidates (same scale as the other arms) -- Goes through the existing RRF fusion (k=60) -- Per-arm weights stay equal (per Hindsight's empirical finding) -- Cross-encoder reranks the union - -The arm earns its slot in the union by surfacing facts that vector+BM25 miss because they're more similar to the *session gist* than to the literal query. - ---- - -## Cost model for BEAM-10M - -| Phase | Cost | -|---|---| -| Ingest (10 convs × ~50 sessions × ~150 facts) | ~$5 LLM (existing pipeline) | -| Session-summary generation | ~$0.55 | -| Conv-summary generation | ~$0.05 | -| Question phase (200 questions × multi-iter answer + judge) | ~$25-40 | -| **Total per seed** | **~$30-45** | -| Multirun n=3 with bootstrap CI | **~$90-135** | - -Fits within remaining LiteLLM budget ($68 of $100). - ---- - -## Implementation order (after Phase 3 schema lands) - -| Step | Task | Output | -|---|---|---| -| 1 | T2.1 design (this doc) | ✓ | -| 2 | T2.2: session-summary generation in ingest pipeline | new module `session-summary-generator.ts`; gated by env flag | -| 3 | T2.2: conv-summary generation hook | added to ingest pipeline finalizer | -| 4 | Schema migration: append session_summaries + conv_summaries to schema.sql | (additive, IF NOT EXISTS) | -| 5 | T2.3: hierarchical retrieval arm in memory-search | wired into RRF fusion; gated by `HIERARCHICAL_RETRIEVAL_ENABLED` | -| 6 | Smoke test on BEAM-100K (single conv) | confirm no regression when flag off; lift on MSR/EO when on | -| 7 | T3.1: BEAM-10M smoke (single conv) | end-to-end at scale | - -Steps 2-5 are ~2-3 weeks of focused engineering. Step 6 is the regression gate before BEAM-10M cost. - ---- - -## Open questions for implementation - -1. **Summary prompt template.** Should session summaries be facts-style ("Alice mentioned she's switching to TypeScript; team agreed on Postgres") or topics-style ("TypeScript migration discussion; database choice debate")? Topics-style aligns with `session_topics` metadata; facts-style is more retrievable. Recommend topics-style for v1. - -2. **Re-summarisation under update.** When a session has new facts ingested after summary generation (e.g., late-arriving messages), do we regenerate the summary? Default: yes, replace; the table's `created_at` reflects the latest gen. - -3. **Cross-conversation summary.** A user's "career trajectory" might span multiple conversations. Should there be a higher-level "user_summary" tier? Defer to Phase 6+; not needed for BEAM-10M. - -4. **Hot-path latency.** Hierarchical arm adds 2 vector lookups + 1 SQL filter per query. With HNSW indexes both lookups are <50ms. Total query latency: vector(50ms) + BM25(20ms) + hierarchical(120ms) + temporal(30ms) + graph(150ms) + RRF(5ms) + rerank(80ms) = ~450ms. Acceptable for non-realtime BEAM eval. - -5. **Belief-tier integration.** When TBC Phase 3 lands, hierarchical retrieval should respect `belief_tier`. Concretely: stage-3 atomic-fact retrieval filters out `tier='retracted'`; directives surface first. This is a one-line WHERE clause addition in T2.3. - ---- - -## Why this is the right next step (not Phase 3 of TBC) - -Two architectural commitments are in flight: -- **TBC Phase 3** (T1.2, T1.3) — typed belief operators with a queryable graph. -- **Hierarchical retrieval** (T2) — multi-level summary indexing for BEAM-10M scale. - -These are **orthogonal** — TBC Phase 3 changes WHAT we store; hierarchical changes HOW we retrieve. They compose cleanly: hierarchical retrieval at stage 3 (atomic) reads the new TBC `belief_tier` column to skip retracted claims and surface directives. - -The right ordering: TBC Phase 3 schema first (T1.2 done), then hierarchical retrieval implementation (T2.2 + T2.3), then dual-write integration (T1.3) so the search layer can read both new shapes simultaneously. Both ship together as the "BEAM-10M architecture commit." diff --git a/docs/superpowers/plans/2026-05-11-beam-085-phase0-phase1.md b/docs/superpowers/plans/2026-05-11-beam-085-phase0-phase1.md deleted file mode 100644 index 50d5cd6..0000000 --- a/docs/superpowers/plans/2026-05-11-beam-085-phase0-phase1.md +++ /dev/null @@ -1,2196 +0,0 @@ -# BEAM 0.85+ — Phase 0 + Phase 1 Implementation Plan - -> **For agentic workers:** REQUIRED SUB-SKILL: Use `superpowers:subagent-driven-development` (recommended) or `superpowers:executing-plans` to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. - -**Goal:** Implement Phase 0 (L1-patched, L3-deleted, baseline relock) and Phase 1 (async Reflect step via Sonnet with `session_reflections` table + `## OBSERVATIONS` retrieval channel) per `docs/superpowers/specs/2026-05-11-beam-085-anthropic-only-design.md`. - -**Architecture:** Phase 0 is surgical cleanup of the broken L3 mechanism and tightening of the L1 over-constraint that hurt event-ordering. Phase 1 adds an async Sonnet-driven consolidation step that runs at session boundaries via a Postgres-backed job queue, writes synthesized observations to a new table, and surfaces them at retrieval time as a separate prompt channel routed by question type. - -**Tech Stack:** TypeScript ESM, Express, Postgres + pgvector, Anthropic SDK (Haiku 4.5 + Sonnet 4.6), Vitest, Docker compose. - -**Gates:** Phase 0 must PASS (composite Δ ≥ +0.05 vs prev baseline 0.411 AND no per-ability regression > 0.10 at 4-conv n=80) before Phase 1 implementation begins. Phase 1 must PASS the same gate before we re-enter brainstorming for Phase 2. - -**Working directory:** `/Users/moralespanitz/me/supernet/atomicmemory-core/.claude/worktrees/tbc-prototype` - -**Constraints (from CLAUDE.md):** -- TypeScript ESM, no `any` -- Files ≤ 400 lines (excluding comments) -- Functions ≤ 40 lines (excluding catch/finally) -- No `process.env` reads outside `src/config.ts` -- Mutations fail closed (no silent fallback) -- JSDoc at top of every file -- Pre-commit: `npx tsc --noEmit`, `npm test`, `fallow --no-cache` - ---- - -## File Structure - -### Phase 0 — modify only - -| File | Change | -|---|---| -| `src/services/answer-format.ts` | Patch ORDERED_LIST hint (relax "EXACTLY"). Tighten classifier to require both `list` AND a numeric/spelled-out token | -| `src/services/__tests__/answer-format.test.ts` | New cases for patched behavior | -| `src/services/counter-edge-surface.ts` | **DELETE** | -| `src/services/__tests__/counter-edge-surface.test.ts` | **DELETE** | -| `src/config.ts` | Remove `counterEdgeSurfaceEnabled` flag | -| `src/app/runtime-container.ts` | Remove `counterEdgeSurfaceEnabled` from interface and unused belief-edges retrieval wiring | -| `src/services/search-pipeline.ts` | Remove call to `maybeSurfaceCounterEdges` | -| `src/services/retrieval-format.ts` | Remove `[CONTRADICTS prior fact]` marker emission | -| `src/db/repository-types.ts` | Remove `counterOf?: string` from `SearchResult` | - -### Phase 1 — create - -| File | Role | -|---|---| -| `src/db/migrations/20260512_session_reflections.sql` | Schema for `session_reflections` + `reflection_jobs` | -| `src/db/reflections-repository.ts` | CRUD + cosine-similarity search on `session_reflections` | -| `src/db/reflection-jobs-repository.ts` | Postgres-backed job queue ops | -| `src/services/reflect-prompts.ts` | Sonnet system prompt + Anthropic tool-use schema for consolidation | -| `src/services/reflect.ts` | Orchestrator: load session memories → call Sonnet → parse → persist | -| `src/services/reflect-jobs.ts` | Worker: poll queue, run reflect, mark status | -| `src/services/reflect-retrieval.ts` | Query-time top-K reflection fetch by cosine similarity | -| `src/services/__tests__/reflect.test.ts` | Unit tests for orchestrator | -| `src/services/__tests__/reflect-jobs.test.ts` | Worker unit tests | -| `src/services/__tests__/reflect-retrieval.test.ts` | Retrieval-side tests | -| `src/db/__tests__/reflections-repository.test.ts` | Repo integration tests (real Postgres) | -| `src/db/__tests__/reflection-jobs-repository.test.ts` | Queue integration tests | - -### Phase 1 — modify - -| File | Change | -|---|---| -| `src/services/memory-ingest.ts` | After AUDN commit, enqueue `reflection_jobs` row for `(userId, conversationId)` | -| `src/services/search-pipeline.ts` | When query classifier matches SUM/KU/MSR/CR/PF/IE, fetch top-5 reflections and pass through stores to retrieval-format | -| `src/services/retrieval-format.ts` | Emit `## OBSERVATIONS` prompt channel when reflections are present | -| `src/db/stores.ts` | Add `reflections: ReflectionsRepository` and `reflectionJobs: ReflectionJobsRepository` | -| `src/app/runtime-container.ts` | Instantiate the new repos and start the reflect-jobs worker | -| `src/config.ts` | Add `REFLECT_ENABLED`, `REFLECT_MODEL`, `REFLECT_MAX_OBSERVATIONS`, `REFLECT_JOB_POLL_MS`, `REFLECT_DEBOUNCE_MS`, `REFLECT_RETRIEVAL_TOP_K` | -| `src/routes/reflect.ts` | NEW route file mounted at `/v1/reflect/flush` for synchronous benchmark-mode flush | - -### Validation env files (Phase 0 + Phase 1) - -| File | Role | -|---|---| -| `.env.phase0-l1patched` | Kept stack + L1-patched (`ANSWER_FORMAT_ALIGNMENT_ENABLED=true`) + L3-deleted. Port 3102/5502. | -| `.env.phase1-reflect` | Phase 0 + `REFLECT_ENABLED=true`, `REFLECT_MODEL=claude-sonnet-4-5`. Port 3103/5503. | - ---- - -# PHASE 0 — Foundation cleanup - -Goal: relock baseline above 0.411 (kept-stack reproduction) by patching L1's toxic ORDERED_LIST rule and deleting L3's broken `[CONTRADICTS]` marker mechanism. - -## Task 0.1: Patch L1 ORDERED_LIST hint to allow partial answers - -**Files:** -- Modify: `src/services/answer-format.ts` (line 67, FORMAT_HINTS map) -- Test: `src/services/__tests__/answer-format.test.ts` - -- [ ] **Step 1: Write the failing test for the new hint wording** - -Edit `src/services/__tests__/answer-format.test.ts`. Add this test inside the existing `describe('classifyQuestion', () => {})` block at the bottom: - -```typescript -describe('getOutputFormatHint (patched)', () => { - it('ORDERED_LIST hint allows partial answers when items < requested count', () => { - const hint = getOutputFormatHint(QuestionType.ORDERED_LIST); - expect(hint).toContain('if retrievable'); - expect(hint.toLowerCase()).not.toMatch(/exactly the count requested/); - }); -}); -``` - -- [ ] **Step 2: Run the test to verify it fails** - -```bash -cd /Users/moralespanitz/me/supernet/atomicmemory-core/.claude/worktrees/tbc-prototype -npx vitest run src/services/__tests__/answer-format.test.ts -t "ORDERED_LIST hint allows partial" -``` - -Expected: FAIL — current hint contains "EXACTLY the count requested". - -- [ ] **Step 3: Patch the hint in `answer-format.ts`** - -Replace the ORDERED_LIST entry in the `FORMAT_HINTS` map (currently at line ~67): - -```typescript - [QuestionType.ORDERED_LIST]: - "FORMAT: Numbered list. Include EXACTLY the count requested if retrievable from the facts; otherwise list only the items that ARE retrievable and state that fewer than N are available. Format: '1) {item}, 2) {item}, ...'", -``` - -- [ ] **Step 4: Run the test to verify it passes** - -```bash -npx vitest run src/services/__tests__/answer-format.test.ts -t "ORDERED_LIST hint allows partial" -``` - -Expected: PASS. - -- [ ] **Step 5: Run the full answer-format suite to verify no regressions** - -```bash -npx vitest run src/services/__tests__/answer-format.test.ts -``` - -Expected: 11/11 (10 existing + 1 new) PASS. - -- [ ] **Step 6: Commit** - -```bash -git add src/services/answer-format.ts src/services/__tests__/answer-format.test.ts -git commit -m "fix(answer-format): relax ORDERED_LIST hint to allow partial answers - -Sprint 3 L1 diagnostic showed the strict 'EXACTLY the count requested' -phrasing forced Haiku to fabricate items when retrieved facts fell short -of the count. This hurt event_ordering by -0.175 on conv 2 n=20. -Updated hint instructs the model to enumerate only retrievable items -and explicitly state the shortage when count cannot be reached." -``` - -## Task 0.2: Tighten L1 classifier to require numeric token for ORDERED_LIST - -**Files:** -- Modify: `src/services/answer-format.ts` (line 38, ORDERED_LIST_PATTERN) -- Test: `src/services/__tests__/answer-format.test.ts` - -- [ ] **Step 1: Add failing test cases** - -In `src/services/__tests__/answer-format.test.ts`, add inside the `describe('classifyQuestion', () => {})` block: - -```typescript - it('does NOT classify "list common errors" as ORDERED_LIST (no numeric token)', () => { - expect(classifyQuestion('What are some common responses when an API fails? List them.')).toBe( - QuestionType.OTHER, - ); - }); - - it('classifies "list five items in order" as ORDERED_LIST (numeric token present)', () => { - expect(classifyQuestion('Can you list five items in order?')).toBe(QuestionType.ORDERED_LIST); - }); - - it('classifies "Mention ONLY three items" as ORDERED_LIST (numeric token present)', () => { - expect(classifyQuestion('List them in order. Mention ONLY three items.')).toBe( - QuestionType.ORDERED_LIST, - ); - }); -``` - -- [ ] **Step 2: Run to verify the new tests fail** - -```bash -npx vitest run src/services/__tests__/answer-format.test.ts -t "ORDERED_LIST" -``` - -Expected: 2 fail (the "common errors" case currently MATCHES ORDERED_LIST because of the loose pattern, and "Mention ONLY three" doesn't match because no "in order"/sequence/chronological token). - -- [ ] **Step 3: Tighten the regex in `answer-format.ts`** - -Replace the ORDERED_LIST_PATTERN constant (line ~38) with: - -```typescript -// Requires either: -// (a) "list ... in order" / "order in which" / "chronological" — explicit ordering verb, OR -// (b) ordering verb + a spelled-out or digit count token ("three", "5", "ONLY five items") -// This prevents false-positives on generic "list X" / "list common errors" queries. -const ORDERED_LIST_NUMERIC = /\b(\d+|one|two|three|four|five|six|seven|eight|nine|ten)\b/i; -const ORDERED_LIST_HINT = /\b(list|sequence|order|chronological|mention)\b/i; -const ORDERED_LIST_EXPLICIT = /\b(list\s+(?:.*?\s+)?in order|order in which|chronological order)\b/i; -``` - -Then replace the corresponding classifier branch (line ~54) with: - -```typescript - if (ORDERED_LIST_EXPLICIT.test(query)) return QuestionType.ORDERED_LIST; - if (ORDERED_LIST_HINT.test(query) && ORDERED_LIST_NUMERIC.test(query)) { - return QuestionType.ORDERED_LIST; - } -``` - -(Place these BEFORE the existing CONTRADICTION/SUMMARY branches — priority must be preserved.) - -- [ ] **Step 4: Run the new tests to verify they pass** - -```bash -npx vitest run src/services/__tests__/answer-format.test.ts -t "ORDERED_LIST" -``` - -Expected: all PASS, including the existing 1 from prior tasks. - -- [ ] **Step 5: Run the full module suite** - -```bash -npx vitest run src/services/__tests__/answer-format.test.ts -``` - -Expected: 14/14 PASS. - -- [ ] **Step 6: tsc clean** - -```bash -npx tsc --noEmit -``` - -Expected: no errors. - -- [ ] **Step 7: Commit** - -```bash -git add src/services/answer-format.ts src/services/__tests__/answer-format.test.ts -git commit -m "fix(answer-format): require numeric token for ORDERED_LIST classification - -Sprint 3 conv 2 diagnostic: 'What are some common responses' falsely -matched the loose ORDERED_LIST regex via the bare 'list' verb, then -triggered the count-enforcement hint and damaged instruction_following -(-0.50 on that question alone). Tightened classifier to require either -an explicit 'in order' phrase or a numeric/spelled-out count token -alongside the list verb. Adds two negative-case tests." -``` - -## Task 0.3: Delete L3 (counter-edge-surface) module and its config flag - -**Files:** -- Delete: `src/services/counter-edge-surface.ts` -- Delete: `src/services/__tests__/counter-edge-surface.test.ts` -- Modify: `src/config.ts` (remove `counterEdgeSurfaceEnabled`) -- Modify: `src/app/runtime-container.ts` (remove from interface) -- Modify: `src/services/search-pipeline.ts` (remove `maybeSurfaceCounterEdges` call) -- Modify: `src/services/retrieval-format.ts` (remove `[CONTRADICTS prior fact]` marker) -- Modify: `src/db/repository-types.ts` (remove `counterOf` field from `SearchResult`) - -- [ ] **Step 1: Verify nothing other than these files imports counter-edge-surface** - -```bash -grep -rn "counter-edge-surface\|CounterEdgeSurface\|counterEdgeSurfaceEnabled\|counterOf" src/ \ - --include="*.ts" | grep -v __tests__ | grep -v counter-edge-surface.ts -``` - -Expected output (only these references): -- `src/config.ts` — flag declaration -- `src/app/runtime-container.ts` — flag in interface -- `src/services/search-pipeline.ts` — function call -- `src/services/retrieval-format.ts` — marker emission + `counterOf` checks -- `src/db/repository-types.ts` — field definition - -If you see more, STOP and consult the user. - -- [ ] **Step 2: Delete the module and its tests** - -```bash -rm src/services/counter-edge-surface.ts -rm src/services/__tests__/counter-edge-surface.test.ts -``` - -- [ ] **Step 3: Remove the config flag** - -In `src/config.ts`, find and delete the line: - -```typescript - counterEdgeSurfaceEnabled: (optionalEnv('COUNTER_EDGE_SURFACE_ENABLED') ?? 'false') === 'true', -``` - -Also remove the corresponding entry from the `RuntimeConfig` interface and the `INTERNAL_POLICY_CONFIG_FIELDS` array. - -- [ ] **Step 4: Remove from runtime-container interface** - -In `src/app/runtime-container.ts`, find and delete the line: - -```typescript - counterEdgeSurfaceEnabled: boolean; -``` - -from the `CoreRuntimeConfig` interface. - -- [ ] **Step 5: Remove the call site from search-pipeline.ts** - -In `src/services/search-pipeline.ts`, find and delete the import and the `maybeSurfaceCounterEdges` block (it lives between `applyExpansionAndReranking` and the namespace filter). Replace the `surfaced = ...` line with a direct assignment from the prior `selected` variable. - -- [ ] **Step 6: Remove the `[CONTRADICTS]` marker from retrieval-format.ts** - -In `src/services/retrieval-format.ts`, find every reference to `counterOf` (typically in `formatFullLine`, `formatStagedLine`, `formatSubjectSection`, `formatTieredLine`). Remove the conditional that prepends `[CONTRADICTS prior fact ]:` and emit the line normally. - -- [ ] **Step 7: Remove `counterOf` from `repository-types.ts`** - -In `src/db/repository-types.ts`, find and delete the `counterOf?: string;` field from `SearchResult`. - -- [ ] **Step 8: tsc clean** - -```bash -npx tsc --noEmit -``` - -Expected: no errors. If you see "Cannot find name 'counterOf'" or similar, you missed a reference — grep again. - -- [ ] **Step 9: Run the full test suite to verify nothing else broke** - -```bash -npm test -- --reporter=basic 2>&1 | tail -20 -``` - -Expected: all suites pass except any tests directly testing the deleted module (those were deleted in step 2). - -- [ ] **Step 10: Commit** - -```bash -git add -A -git commit -m "chore: delete Layer 3 (counter-edge-surface) — replaced by CR specialist later - -Sprint 3 conv 2 n=20 measurement: L3-only composite 0.377 vs baseline 0.520 -(-0.143). The mechanism designed to lift CR actually dropped CR to 0.125 -(-0.188) because the [CONTRADICTS prior fact] marker confused Haiku into -picking the unmarked side of the contradiction. Per the Phase 0 cleanup -in the BEAM-0.85 design, L3 is deleted entirely. A reworked CR specialist -with explicit FACT A / FACT B framing will arrive in Phase 2.2." -``` - -## Task 0.4: Author the Phase 0 validation env file - -**Files:** -- Create: `.env.phase0-l1patched` - -- [ ] **Step 1: Write the env file** - -Create `/Users/moralespanitz/me/supernet/atomicmemory-core/.claude/worktrees/tbc-prototype/.env.phase0-l1patched`: - -```bash -POSTGRES_PORT=5502 -APP_PORT=3102 -DATABASE_URL=postgresql://atomicmemory:atomicmemory@localhost:5502/atomicmemory -LLM_PROVIDER=anthropic -LLM_API_URL= -LLM_API_KEY= -LLM_MODEL=claude-haiku-4-5 -EMBEDDING_PROVIDER=transformers -EMBEDDING_MODEL=Xenova/all-MiniLM-L6-v2 -EMBEDDING_DIMENSIONS=384 -ANTHROPIC_API_KEY= -ATOMICMEMORY_API_URL=http://localhost:3102 -COST_CAP_DAILY=200 -COST_CAP_ITER=20 -# Kept stack (h3-timeline) -TBC_ENABLED=true -TOPIC_ABSTRACTION_ENABLED=false -TOPIC_SEARCH_ENABLED=false -RERANKER_ENABLED=true -RECAP_LAYER_ENABLED=false -RECAP_SEARCH_ENABLED=false -HIERARCHICAL_RETRIEVAL_ENABLED=false -CHUNKED_EXTRACTION_ENABLED=true -CHUNKED_EXTRACTION_FALLBACK_ENABLED=true -TIMELINE_CHANNEL_ENABLED=true -PACKAGING_USE_OBSERVED_AT=true -# Layer 1 patched + on. Layer 3 deleted (no flag needed). -ANSWER_FORMAT_ALIGNMENT_ENABLED=true -``` - -- [ ] **Step 2: DO NOT commit this file** - -`.env.*` files are blocked by `.gitignore` (everything except `.env.example`). -Real API keys live in these files. The file exists on disk for the docker -stack to read; it never enters git. - -If you `git add` it accidentally, the .gitignore will block it. If you -force-add with `-f`, you have done a security violation — revert before push. - -## Task 0.5: Run the Phase 0 4-conv n=80 validation - -**Files:** -- Will create: `benchmarks-sprint3/results/haiku080/phase0-l1patched/summary.json` - -- [ ] **Step 1: Verify ports 3102 and 5502 are free** - -```bash -for p in 3102 5502; do - if lsof -nP -iTCP:$p -sTCP:LISTEN >/dev/null 2>&1; then echo "$p BUSY"; else echo "$p free"; fi -done -``` - -Expected: both free. If busy, identify and kill the holding process. - -- [ ] **Step 2: Verify LiteLLM is up (still required by the existing runner)** - -```bash -curl -sfm 2 http://localhost:4000/health/liveliness -``` - -Expected: `"I'm alive!"`. If down, start LiteLLM before proceeding. - -- [ ] **Step 3: Run all 4 conversations sequentially** - -```bash -/Users/moralespanitz/me/supernet/atomicmemory-research/memory-research/benchmarks-sprint3/tools/run_parallel_cell.sh \ - phase0-l1patched .env.phase0-l1patched 3102 5502 am-phase0-l1patched \ - 1,2,3,4 anthropic-haiku-4-5 anthropic-haiku-4-5 -``` - -Expected wall time: ~25 min. Expected cost: ~$2. - -- [ ] **Step 4: Inspect the summary** - -```bash -cat /Users/moralespanitz/me/supernet/atomicmemory-research/memory-research/benchmarks-sprint3/results/haiku080/phase0-l1patched/summary.json -``` - -Record the composite and per-ability scores. Compare against `h3-timeline/summary.json` (the previous baseline). - -- [ ] **Step 5: Run the per-question diff** - -```bash -python3 << 'PYEOF' -import json -prev = json.load(open('/Users/moralespanitz/me/supernet/atomicmemory-research/memory-research/benchmarks-sprint3/results/haiku080/h3-timeline/summary.json')) -new = json.load(open('/Users/moralespanitz/me/supernet/atomicmemory-research/memory-research/benchmarks-sprint3/results/haiku080/phase0-l1patched/summary.json')) -print(f"composite: {prev['composite']:.3f} -> {new['composite']:.3f} (delta {new['composite']-prev['composite']:+.3f})") -print("per-ability:") -for ab in sorted(prev['per_ability']): - p, n = prev['per_ability'][ab], new['per_ability'][ab] - print(f" {ab:25s}: {p:.3f} -> {n:.3f} (delta {n-p:+.3f})") -PYEOF -``` - -- [ ] **Step 6: Apply the Phase 0 gate** - -Decision table: - -| Composite Δ | Worst per-ability Δ | Verdict | -|---|---|---| -| ≥ +0.05 | ≥ -0.10 | **PASS** — proceed to Phase 1 | -| ∈ [-0.03, +0.05] | ≥ -0.10 | **PLATEAU** — diagnose via per-question diff, document, then proceed to Phase 1 anyway (L1 is the only Phase 0 change and the patch is theory-correct; we don't expect a large lift, we expect "doesn't regress") | -| < -0.03 OR any ability < -0.10 | — | **REGRESS** — stop. Run `git revert HEAD~5..HEAD` to undo Phase 0. Open a diagnostic and consult the user. | - -- [ ] **Step 7: Write the Phase 0 diagnostic doc** - -Create `/Users/moralespanitz/me/supernet/atomicmemory-research/memory-research/benchmarks-sprint5/phase-0-diagnostic.md` with: -- Composite before / after / delta -- Per-ability table -- 1-paragraph diagnosis from the per-question diff -- Verdict (PASS / PLATEAU / REGRESS) -- Next step - -- [ ] **Step 8: Commit** - -```bash -cd /Users/moralespanitz/me/supernet/atomicmemory-research -git add memory-research/benchmarks-sprint3/results/haiku080/phase0-l1patched/ \ - memory-research/benchmarks-sprint5/phase-0-diagnostic.md -git commit -m "results: Phase 0 4-conv n=80 validation" -``` - -- [ ] **Step 9: Gate** - -If verdict is REGRESS, stop and consult the user. Otherwise proceed to Phase 1. - ---- - -# PHASE 1 — Reflect step (async Sonnet consolidation) - -Goal: add an async LLM consolidation step that runs at session boundaries, writes synthesized observations to a new `session_reflections` table, and surfaces them via a `## OBSERVATIONS` retrieval channel for SUM / KU / MSR / CR / PF / IE queries. - -## Task 1.1: Author the migration for `session_reflections` and `reflection_jobs` - -**Files:** -- Create: `src/db/migrations/20260512_session_reflections.sql` - -- [ ] **Step 1: Write the migration** - -```sql --- 20260512_session_reflections.sql --- Phase 1 of BEAM-0.85 plan: async Reflect step storage. --- --- Two tables: --- session_reflections: synthesized observations per (user_id, conversation_id), --- each citing evidence_memory_ids and embedded for retrieval --- reflection_jobs: Postgres-backed async work queue for the reflect worker - -CREATE TABLE IF NOT EXISTS session_reflections ( - id UUID PRIMARY KEY DEFAULT gen_random_uuid(), - user_id TEXT NOT NULL, - conversation_id TEXT NOT NULL, - observation TEXT NOT NULL, - observation_type TEXT NOT NULL CHECK (observation_type IN ( - 'entity_state', 'event_summary', 'preference', - 'contradiction', 'decision', 'numeric_value' - )), - evidence_memory_ids TEXT[] NOT NULL, - embedding vector(384), - created_at TIMESTAMPTZ NOT NULL DEFAULT now() -); - -CREATE INDEX IF NOT EXISTS ix_session_reflections_user_conv - ON session_reflections (user_id, conversation_id); - -CREATE INDEX IF NOT EXISTS ix_session_reflections_embedding - ON session_reflections USING hnsw (embedding vector_cosine_ops); - -CREATE TABLE IF NOT EXISTS reflection_jobs ( - id UUID PRIMARY KEY DEFAULT gen_random_uuid(), - user_id TEXT NOT NULL, - conversation_id TEXT NOT NULL, - status TEXT NOT NULL DEFAULT 'pending' - CHECK (status IN ('pending', 'in_progress', 'completed', 'failed')), - attempts INTEGER NOT NULL DEFAULT 0, - last_error TEXT, - created_at TIMESTAMPTZ NOT NULL DEFAULT now(), - last_tried_at TIMESTAMPTZ -); - -CREATE UNIQUE INDEX IF NOT EXISTS ix_reflection_jobs_pending_unique - ON reflection_jobs (user_id, conversation_id) - WHERE status IN ('pending', 'in_progress'); - -CREATE INDEX IF NOT EXISTS ix_reflection_jobs_status_created - ON reflection_jobs (status, created_at); -``` - -- [ ] **Step 2: Run migration on the test DB** - -```bash -npm run migrate:test -``` - -Expected: no errors; tables created. - -- [ ] **Step 3: Verify schema** - -```bash -dotenv -e .env.test -- psql "$DATABASE_URL" -c "\d session_reflections" -c "\d reflection_jobs" -``` - -Expected: both tables shown with the columns above. - -- [ ] **Step 4: Commit** - -```bash -git add src/db/migrations/20260512_session_reflections.sql -git commit -m "migration: add session_reflections + reflection_jobs tables" -``` - -## Task 1.2: Implement `ReflectionsRepository` (CRUD + similarity search) - -**Files:** -- Create: `src/db/reflections-repository.ts` -- Create: `src/db/__tests__/reflections-repository.test.ts` - -- [ ] **Step 1: Write the failing integration test** - -```typescript -// src/db/__tests__/reflections-repository.test.ts -/** - * Integration tests for ReflectionsRepository. Uses the .env.test Postgres - * instance; assumes the 20260512_session_reflections migration has been - * applied (via `npm run migrate:test`). - */ -import { afterAll, beforeEach, describe, expect, it } from 'vitest'; -import pg from 'pg'; -import { ReflectionsRepository, type NewReflection } from '../reflections-repository.js'; -import { config } from '../../config.js'; - -const pool = new pg.Pool({ connectionString: config.databaseUrl }); -const repo = new ReflectionsRepository(pool); - -afterAll(async () => { await pool.end(); }); - -beforeEach(async () => { - await pool.query("DELETE FROM session_reflections WHERE user_id LIKE 'test-refl-%'"); -}); - -const USER = 'test-refl-1'; -const CONV = 'conv-A'; -const VEC = (n: number): number[] => Array.from({ length: 384 }, () => n); - -describe('ReflectionsRepository', () => { - it('inserts and reads back reflections by (userId, conversationId)', async () => { - const rows: NewReflection[] = [ - { userId: USER, conversationId: CONV, - observation: 'User uses Flask-Login v0.6.2', - observationType: 'entity_state', - evidenceMemoryIds: ['m1', 'm2'], - embedding: VEC(0.1) }, - ]; - await repo.insertMany(rows); - const found = await repo.findByConversation(USER, CONV); - expect(found).toHaveLength(1); - expect(found[0].observation).toBe('User uses Flask-Login v0.6.2'); - expect(found[0].observationType).toBe('entity_state'); - expect(found[0].evidenceMemoryIds).toEqual(['m1', 'm2']); - }); - - it('findSimilar returns the most cosine-similar reflections first', async () => { - await repo.insertMany([ - { userId: USER, conversationId: CONV, - observation: 'similar', observationType: 'event_summary', - evidenceMemoryIds: ['m1'], embedding: VEC(0.1) }, - { userId: USER, conversationId: CONV, - observation: 'far', observationType: 'event_summary', - evidenceMemoryIds: ['m2'], embedding: VEC(-0.9) }, - ]); - const hits = await repo.findSimilar(USER, VEC(0.1), 2); - expect(hits[0].observation).toBe('similar'); - expect(hits[1].observation).toBe('far'); - }); - - it('returns empty array when no reflections exist', async () => { - const hits = await repo.findSimilar(USER, VEC(0.5), 5); - expect(hits).toEqual([]); - }); -}); -``` - -- [ ] **Step 2: Run the test to verify it fails** - -```bash -dotenv -e .env.test -- npx vitest run src/db/__tests__/reflections-repository.test.ts -``` - -Expected: FAIL — module not found. - -- [ ] **Step 3: Implement the repository** - -```typescript -// src/db/reflections-repository.ts -/** - * ReflectionsRepository — CRUD plus cosine-similarity search for the - * session_reflections table. Each row is an LLM-synthesized observation about - * a conversation, with citations to the supporting memory ids and an embedding - * for retrieval-side similarity search. - * - * Pure SQL via pg.Pool. No ORM. Mutations fail closed: caller catches errors, - * we propagate them with the original error attached. - */ -import pg from 'pg'; - -export type ObservationType = - | 'entity_state' - | 'event_summary' - | 'preference' - | 'contradiction' - | 'decision' - | 'numeric_value'; - -export interface NewReflection { - userId: string; - conversationId: string; - observation: string; - observationType: ObservationType; - evidenceMemoryIds: string[]; - embedding: number[]; -} - -export interface Reflection extends NewReflection { - id: string; - createdAt: Date; -} - -function vectorLiteral(vec: number[]): string { - return `[${vec.join(',')}]`; -} - -export class ReflectionsRepository { - constructor(private readonly pool: pg.Pool) {} - - async insertMany(rows: readonly NewReflection[]): Promise { - if (rows.length === 0) return; - const sql = ` - INSERT INTO session_reflections - (user_id, conversation_id, observation, observation_type, evidence_memory_ids, embedding) - VALUES ($1, $2, $3, $4, $5, $6::vector) - `; - const client = await this.pool.connect(); - try { - await client.query('BEGIN'); - for (const r of rows) { - await client.query(sql, [ - r.userId, r.conversationId, r.observation, r.observationType, - r.evidenceMemoryIds, vectorLiteral(r.embedding), - ]); - } - await client.query('COMMIT'); - } catch (e) { - await client.query('ROLLBACK'); - throw e; - } finally { - client.release(); - } - } - - async findByConversation(userId: string, conversationId: string): Promise { - const { rows } = await this.pool.query( - `SELECT id, user_id, conversation_id, observation, observation_type, - evidence_memory_ids, created_at - FROM session_reflections - WHERE user_id = $1 AND conversation_id = $2 - ORDER BY created_at ASC`, - [userId, conversationId], - ); - return rows.map(mapRow); - } - - async findSimilar(userId: string, queryEmbedding: number[], topK: number): Promise { - const { rows } = await this.pool.query( - `SELECT id, user_id, conversation_id, observation, observation_type, - evidence_memory_ids, created_at - FROM session_reflections - WHERE user_id = $1 - ORDER BY embedding <=> $2::vector - LIMIT $3`, - [userId, vectorLiteral(queryEmbedding), topK], - ); - return rows.map(mapRow); - } -} - -function mapRow(r: pg.QueryResultRow): Reflection { - return { - id: r.id, - userId: r.user_id, - conversationId: r.conversation_id, - observation: r.observation, - observationType: r.observation_type, - evidenceMemoryIds: r.evidence_memory_ids, - embedding: [], - createdAt: r.created_at, - }; -} -``` - -- [ ] **Step 4: Run tests to verify they pass** - -```bash -dotenv -e .env.test -- npx vitest run src/db/__tests__/reflections-repository.test.ts -``` - -Expected: 3/3 PASS. - -- [ ] **Step 5: tsc clean** - -```bash -npx tsc --noEmit -``` - -Expected: no errors. - -- [ ] **Step 6: Commit** - -```bash -git add src/db/reflections-repository.ts src/db/__tests__/reflections-repository.test.ts -git commit -m "feat(reflections): add ReflectionsRepository with insertMany, findByConversation, findSimilar" -``` - -## Task 1.3: Implement `ReflectionJobsRepository` (Postgres-backed queue) - -**Files:** -- Create: `src/db/reflection-jobs-repository.ts` -- Create: `src/db/__tests__/reflection-jobs-repository.test.ts` - -- [ ] **Step 1: Write the failing test** - -```typescript -// src/db/__tests__/reflection-jobs-repository.test.ts -import { afterAll, beforeEach, describe, expect, it } from 'vitest'; -import pg from 'pg'; -import { ReflectionJobsRepository } from '../reflection-jobs-repository.js'; -import { config } from '../../config.js'; - -const pool = new pg.Pool({ connectionString: config.databaseUrl }); -const repo = new ReflectionJobsRepository(pool); - -afterAll(async () => { await pool.end(); }); - -beforeEach(async () => { - await pool.query("DELETE FROM reflection_jobs WHERE user_id LIKE 'test-rjq-%'"); -}); - -const USER = 'test-rjq-1'; -const CONV = 'conv-A'; - -describe('ReflectionJobsRepository', () => { - it('enqueue creates a pending job', async () => { - await repo.enqueue(USER, CONV); - const ready = await repo.fetchPending(10); - expect(ready).toHaveLength(1); - expect(ready[0].status).toBe('pending'); - expect(ready[0].userId).toBe(USER); - }); - - it('enqueue is idempotent per (userId, conversationId) while pending or in_progress', async () => { - await repo.enqueue(USER, CONV); - await repo.enqueue(USER, CONV); - const ready = await repo.fetchPending(10); - expect(ready).toHaveLength(1); - }); - - it('markInProgress / markCompleted / markFailed flow', async () => { - await repo.enqueue(USER, CONV); - const [job] = await repo.fetchPending(10); - await repo.markInProgress(job.id); - let row = await repo.findById(job.id); - expect(row?.status).toBe('in_progress'); - await repo.markCompleted(job.id); - row = await repo.findById(job.id); - expect(row?.status).toBe('completed'); - - await repo.enqueue(USER, 'conv-B'); - const [other] = await repo.fetchPending(10); - await repo.markFailed(other.id, 'boom'); - row = await repo.findById(other.id); - expect(row?.status).toBe('failed'); - expect(row?.lastError).toBe('boom'); - }); - - it('after completion, enqueue for same (user, conv) creates a new job', async () => { - await repo.enqueue(USER, CONV); - const [j] = await repo.fetchPending(10); - await repo.markInProgress(j.id); - await repo.markCompleted(j.id); - await repo.enqueue(USER, CONV); - const again = await repo.fetchPending(10); - expect(again).toHaveLength(1); - expect(again[0].id).not.toBe(j.id); - }); -}); -``` - -- [ ] **Step 2: Verify fail** - -```bash -dotenv -e .env.test -- npx vitest run src/db/__tests__/reflection-jobs-repository.test.ts -``` - -Expected: FAIL — module not found. - -- [ ] **Step 3: Implement** - -```typescript -// src/db/reflection-jobs-repository.ts -/** - * Postgres-backed work queue for the async Reflect step. - * - * Idempotent enqueue: a unique partial index on (user_id, conversation_id) - * WHERE status IN ('pending','in_progress') guarantees one in-flight job per - * conversation at a time. Re-enqueue after completion creates a new job (the - * unique index excludes 'completed' and 'failed'). - * - * The worker (services/reflect-jobs.ts) drives the lifecycle: fetchPending → - * markInProgress → run reflect → markCompleted | markFailed. - */ -import pg from 'pg'; - -export type JobStatus = 'pending' | 'in_progress' | 'completed' | 'failed'; - -export interface ReflectionJob { - id: string; - userId: string; - conversationId: string; - status: JobStatus; - attempts: number; - lastError: string | null; - createdAt: Date; - lastTriedAt: Date | null; -} - -export class ReflectionJobsRepository { - constructor(private readonly pool: pg.Pool) {} - - async enqueue(userId: string, conversationId: string): Promise { - await this.pool.query( - `INSERT INTO reflection_jobs (user_id, conversation_id) VALUES ($1, $2) - ON CONFLICT DO NOTHING`, - [userId, conversationId], - ); - } - - async fetchPending(limit: number): Promise { - const { rows } = await this.pool.query( - `SELECT id, user_id, conversation_id, status, attempts, last_error, - created_at, last_tried_at - FROM reflection_jobs - WHERE status = 'pending' - ORDER BY created_at ASC - LIMIT $1`, - [limit], - ); - return rows.map(mapJob); - } - - async markInProgress(id: string): Promise { - await this.pool.query( - `UPDATE reflection_jobs - SET status = 'in_progress', attempts = attempts + 1, last_tried_at = now() - WHERE id = $1`, - [id], - ); - } - - async markCompleted(id: string): Promise { - await this.pool.query( - `UPDATE reflection_jobs SET status = 'completed' WHERE id = $1`, - [id], - ); - } - - async markFailed(id: string, error: string): Promise { - await this.pool.query( - `UPDATE reflection_jobs SET status = 'failed', last_error = $2 WHERE id = $1`, - [id, error], - ); - } - - async findById(id: string): Promise { - const { rows } = await this.pool.query( - `SELECT id, user_id, conversation_id, status, attempts, last_error, - created_at, last_tried_at - FROM reflection_jobs WHERE id = $1`, - [id], - ); - return rows[0] ? mapJob(rows[0]) : null; - } -} - -function mapJob(r: pg.QueryResultRow): ReflectionJob { - return { - id: r.id, - userId: r.user_id, - conversationId: r.conversation_id, - status: r.status, - attempts: r.attempts, - lastError: r.last_error, - createdAt: r.created_at, - lastTriedAt: r.last_tried_at, - }; -} -``` - -- [ ] **Step 4: Verify pass** - -```bash -dotenv -e .env.test -- npx vitest run src/db/__tests__/reflection-jobs-repository.test.ts -``` - -Expected: 4/4 PASS. - -- [ ] **Step 5: Commit** - -```bash -git add src/db/reflection-jobs-repository.ts src/db/__tests__/reflection-jobs-repository.test.ts -git commit -m "feat(reflect): add Postgres-backed reflection_jobs queue repository" -``` - -## Task 1.4: Author `reflect-prompts.ts` — Sonnet system prompt + tool-use schema - -**Files:** -- Create: `src/services/reflect-prompts.ts` -- Create: `src/services/__tests__/reflect-prompts.test.ts` - -- [ ] **Step 1: Write the failing unit test** - -```typescript -// src/services/__tests__/reflect-prompts.test.ts -import { describe, expect, it } from 'vitest'; -import { buildReflectMessages, REFLECT_TOOL_SCHEMA } from '../reflect-prompts.js'; - -describe('reflect-prompts', () => { - it('REFLECT_TOOL_SCHEMA defines record_observations with required fields', () => { - expect(REFLECT_TOOL_SCHEMA.name).toBe('record_observations'); - const props = REFLECT_TOOL_SCHEMA.input_schema.properties; - expect(props).toBeDefined(); - expect(props.observations).toBeDefined(); - expect(props.observations.type).toBe('array'); - const items = props.observations.items; - expect(items.required).toEqual(expect.arrayContaining(['text', 'type', 'evidence_memory_ids'])); - expect(items.properties.type.enum).toEqual(expect.arrayContaining([ - 'entity_state', 'event_summary', 'preference', - 'contradiction', 'decision', 'numeric_value', - ])); - }); - - it('buildReflectMessages includes each memory id and observation type list', () => { - const memories = [ - { id: 'm1', text: 'User uses Flask 2.3', observedAt: new Date('2026-03-01') }, - { id: 'm2', text: 'User never used Flask', observedAt: new Date('2026-03-15') }, - ]; - const { system, user } = buildReflectMessages(memories); - expect(system).toContain('observations'); - expect(user).toContain('m1'); - expect(user).toContain('m2'); - expect(user).toContain('User uses Flask 2.3'); - expect(user).toContain('User never used Flask'); - }); -}); -``` - -- [ ] **Step 2: Verify fail** - -```bash -npx vitest run src/services/__tests__/reflect-prompts.test.ts -``` - -Expected: FAIL — module not found. - -- [ ] **Step 3: Implement** - -```typescript -// src/services/reflect-prompts.ts -/** - * Prompt assembly + Anthropic tool-use schema for the async Reflect step. - * - * The Reflect call presents Sonnet with a chronologically-sorted list of the - * session's raw memories (each with its memory id and observed_at) and asks - * Sonnet to consolidate them into a small set of typed observations. Each - * observation MUST cite the memory_ids it draws from, so retrieval can verify - * evidence still exists when the observation is later read by the answer LLM. - * - * Tool-use guarantees structured output — Sonnet returns a JSON payload that - * matches REFLECT_TOOL_SCHEMA, eliminating the freeform-prose parsing failures - * we saw with the Sprint 3 verifier pass. - */ - -export interface ReflectMemoryInput { - id: string; - text: string; - observedAt: Date; -} - -export interface ReflectMessages { - system: string; - user: string; -} - -const SYSTEM_PROMPT = [ - 'You are consolidating a single conversation\'s raw memories into a small set of typed observations.', - 'Each observation must (a) be answerable from the cited evidence_memory_ids alone, (b) prefer concrete factual claims over narrative, (c) avoid restating the raw facts verbatim.', - '', - 'Observation types (use exactly one per observation):', - '- entity_state: the current value of an attribute on an entity, with the latest-known value', - '- event_summary: a discrete event or action that happened', - '- preference: a stated user preference, opinion, or choice', - '- contradiction: two facts in the session that disagree (include both sides)', - '- decision: a user decision made during the session', - '- numeric_value: a numeric fact (count, amount, duration, percentage)', - '', - 'Output 5–15 observations covering distinct claims. Call the record_observations tool.', -].join('\n'); - -export const REFLECT_TOOL_SCHEMA = { - name: 'record_observations', - description: 'Persist the consolidated observations for this conversation.', - input_schema: { - type: 'object', - properties: { - observations: { - type: 'array', - items: { - type: 'object', - required: ['text', 'type', 'evidence_memory_ids'], - properties: { - text: { type: 'string' }, - type: { - type: 'string', - enum: [ - 'entity_state', 'event_summary', 'preference', - 'contradiction', 'decision', 'numeric_value', - ], - }, - evidence_memory_ids: { - type: 'array', - items: { type: 'string' }, - }, - }, - }, - }, - }, - required: ['observations'], - }, -} as const; - -export function buildReflectMessages(memories: readonly ReflectMemoryInput[]): ReflectMessages { - const lines = memories.map( - m => `[${m.id}] (${m.observedAt.toISOString().slice(0, 10)}) ${m.text}`, - ); - const user = ['Memories from this conversation (chronological):', '', ...lines].join('\n'); - return { system: SYSTEM_PROMPT, user }; -} -``` - -- [ ] **Step 4: Verify pass** - -```bash -npx vitest run src/services/__tests__/reflect-prompts.test.ts -``` - -Expected: 2/2 PASS. - -- [ ] **Step 5: Commit** - -```bash -git add src/services/reflect-prompts.ts src/services/__tests__/reflect-prompts.test.ts -git commit -m "feat(reflect): add Sonnet system prompt + tool-use schema for record_observations" -``` - -## Task 1.5: Implement `reflect.ts` orchestrator - -**Files:** -- Create: `src/services/reflect.ts` -- Create: `src/services/__tests__/reflect.test.ts` - -- [ ] **Step 1: Write the failing test (mocked LLM + repo)** - -```typescript -// src/services/__tests__/reflect.test.ts -import { describe, expect, it, vi } from 'vitest'; -import { runReflectForConversation, type ReflectDeps } from '../reflect.js'; - -const memories = [ - { id: 'm1', text: 'first', observedAt: new Date('2026-03-01') }, - { id: 'm2', text: 'second', observedAt: new Date('2026-03-02') }, -]; - -const toolOutput = { - observations: [ - { text: 'O1', type: 'event_summary' as const, evidence_memory_ids: ['m1', 'm2'] }, - { text: 'O2', type: 'preference' as const, evidence_memory_ids: ['m1'] }, - ], -}; - -describe('runReflectForConversation', () => { - it('calls LLM with built messages, embeds each observation, persists with citations', async () => { - const insertMany = vi.fn().mockResolvedValue(undefined); - const llmTool = vi.fn().mockResolvedValue(toolOutput); - const embed = vi.fn().mockResolvedValue([0.1, 0.2]); - const fetchMemories = vi.fn().mockResolvedValue(memories); - const deps: ReflectDeps = { - fetchMemories, - llmCallTool: llmTool, - embed, - reflections: { insertMany } as any, - maxObservations: 15, - }; - const res = await runReflectForConversation(deps, 'u1', 'c1'); - expect(fetchMemories).toHaveBeenCalledWith('u1', 'c1'); - expect(llmTool).toHaveBeenCalledTimes(1); - expect(embed).toHaveBeenCalledTimes(2); - expect(insertMany).toHaveBeenCalledTimes(1); - const inserted = insertMany.mock.calls[0][0]; - expect(inserted).toHaveLength(2); - expect(inserted[0].observation).toBe('O1'); - expect(inserted[0].evidenceMemoryIds).toEqual(['m1', 'm2']); - expect(res.count).toBe(2); - }); - - it('returns count=0 when conversation has no memories', async () => { - const deps: ReflectDeps = { - fetchMemories: vi.fn().mockResolvedValue([]), - llmCallTool: vi.fn(), - embed: vi.fn(), - reflections: { insertMany: vi.fn() } as any, - maxObservations: 15, - }; - const res = await runReflectForConversation(deps, 'u1', 'c1'); - expect(res.count).toBe(0); - expect(deps.llmCallTool).not.toHaveBeenCalled(); - }); - - it('truncates observations to maxObservations', async () => { - const insertMany = vi.fn().mockResolvedValue(undefined); - const big = { observations: Array.from({ length: 20 }, (_, i) => ({ - text: `O${i}`, type: 'event_summary' as const, evidence_memory_ids: ['m1'], - })) }; - const deps: ReflectDeps = { - fetchMemories: vi.fn().mockResolvedValue(memories), - llmCallTool: vi.fn().mockResolvedValue(big), - embed: vi.fn().mockResolvedValue([0.1]), - reflections: { insertMany } as any, - maxObservations: 5, - }; - const res = await runReflectForConversation(deps, 'u1', 'c1'); - expect(res.count).toBe(5); - }); -}); -``` - -- [ ] **Step 2: Verify fail** - -```bash -npx vitest run src/services/__tests__/reflect.test.ts -``` - -Expected: FAIL — module not found. - -- [ ] **Step 3: Implement** - -```typescript -// src/services/reflect.ts -/** - * Reflect orchestrator. Pulls a conversation's memories, sends them to the - * answer-LLM tool-use endpoint with the record_observations schema, embeds - * each returned observation, and persists them to session_reflections. - * - * Pure dependency-injected — the worker (reflect-jobs) supplies real - * implementations; tests supply mocks. No I/O of its own beyond what the - * injected dependencies do. - */ -import type { - ReflectionsRepository, - NewReflection, - ObservationType, -} from '../db/reflections-repository.js'; -import { - buildReflectMessages, - REFLECT_TOOL_SCHEMA, - type ReflectMemoryInput, -} from './reflect-prompts.js'; - -export interface ReflectToolOutput { - observations: Array<{ - text: string; - type: ObservationType; - evidence_memory_ids: string[]; - }>; -} - -export interface ReflectDeps { - fetchMemories: (userId: string, conversationId: string) => Promise; - llmCallTool: (system: string, user: string, toolSchema: typeof REFLECT_TOOL_SCHEMA) - => Promise; - embed: (text: string) => Promise; - reflections: Pick; - maxObservations: number; -} - -export interface ReflectResult { - count: number; -} - -export async function runReflectForConversation( - deps: ReflectDeps, - userId: string, - conversationId: string, -): Promise { - const memories = await deps.fetchMemories(userId, conversationId); - if (memories.length === 0) return { count: 0 }; - - const { system, user } = buildReflectMessages(memories); - const out = await deps.llmCallTool(system, user, REFLECT_TOOL_SCHEMA); - - const truncated = out.observations.slice(0, deps.maxObservations); - const rows: NewReflection[] = []; - for (const o of truncated) { - const embedding = await deps.embed(o.text); - rows.push({ - userId, - conversationId, - observation: o.text, - observationType: o.type, - evidenceMemoryIds: o.evidence_memory_ids, - embedding, - }); - } - - await deps.reflections.insertMany(rows); - return { count: rows.length }; -} -``` - -- [ ] **Step 4: Verify pass** - -```bash -npx vitest run src/services/__tests__/reflect.test.ts -``` - -Expected: 3/3 PASS. - -- [ ] **Step 5: Commit** - -```bash -git add src/services/reflect.ts src/services/__tests__/reflect.test.ts -git commit -m "feat(reflect): add runReflectForConversation orchestrator with DI" -``` - -## Task 1.6: Implement `reflect-jobs.ts` worker - -**Files:** -- Create: `src/services/reflect-jobs.ts` -- Create: `src/services/__tests__/reflect-jobs.test.ts` - -- [ ] **Step 1: Write the failing test** - -```typescript -// src/services/__tests__/reflect-jobs.test.ts -import { describe, expect, it, vi } from 'vitest'; -import { processOnePendingJob, type JobsWorkerDeps } from '../reflect-jobs.js'; - -const baseDeps = (): JobsWorkerDeps => ({ - jobs: { - fetchPending: vi.fn(), - markInProgress: vi.fn().mockResolvedValue(undefined), - markCompleted: vi.fn().mockResolvedValue(undefined), - markFailed: vi.fn().mockResolvedValue(undefined), - } as any, - runReflect: vi.fn().mockResolvedValue({ count: 3 }), -}); - -describe('processOnePendingJob', () => { - it('returns false when no pending job available', async () => { - const deps = baseDeps(); - (deps.jobs.fetchPending as any).mockResolvedValue([]); - const did = await processOnePendingJob(deps); - expect(did).toBe(false); - expect(deps.runReflect).not.toHaveBeenCalled(); - }); - - it('marks in_progress, runs reflect, marks completed on success', async () => { - const deps = baseDeps(); - (deps.jobs.fetchPending as any).mockResolvedValue([ - { id: 'j1', userId: 'u', conversationId: 'c' }, - ]); - const did = await processOnePendingJob(deps); - expect(did).toBe(true); - expect(deps.jobs.markInProgress).toHaveBeenCalledWith('j1'); - expect(deps.runReflect).toHaveBeenCalledWith('u', 'c'); - expect(deps.jobs.markCompleted).toHaveBeenCalledWith('j1'); - expect(deps.jobs.markFailed).not.toHaveBeenCalled(); - }); - - it('marks failed when runReflect throws', async () => { - const deps = baseDeps(); - (deps.jobs.fetchPending as any).mockResolvedValue([ - { id: 'j2', userId: 'u', conversationId: 'c' }, - ]); - (deps.runReflect as any).mockRejectedValue(new Error('boom')); - const did = await processOnePendingJob(deps); - expect(did).toBe(true); - expect(deps.jobs.markFailed).toHaveBeenCalledWith('j2', expect.stringContaining('boom')); - expect(deps.jobs.markCompleted).not.toHaveBeenCalled(); - }); -}); -``` - -- [ ] **Step 2: Verify fail** - -```bash -npx vitest run src/services/__tests__/reflect-jobs.test.ts -``` - -Expected: FAIL — module not found. - -- [ ] **Step 3: Implement** - -```typescript -// src/services/reflect-jobs.ts -/** - * Reflect worker. Pulls one pending job at a time from reflection_jobs, - * marks it in_progress, invokes the Reflect orchestrator, and records the - * outcome on the job row. - * - * Mutations fail closed: if Reflect throws, the job is marked failed with - * the error message — the loop continues with the next job. The worker never - * silently swallows errors. - * - * Designed for single-instance deployment; multi-instance leasing is out of - * scope for v1 (the unique partial index on (user_id, conversation_id) WHERE - * status IN ('pending','in_progress') keeps work bounded if accidentally - * double-deployed, but doesn't prevent two workers picking different jobs). - */ -import type { ReflectionJobsRepository } from '../db/reflection-jobs-repository.js'; -import type { ReflectResult } from './reflect.js'; - -export interface JobsWorkerDeps { - jobs: Pick; - runReflect: (userId: string, conversationId: string) => Promise; -} - -export async function processOnePendingJob(deps: JobsWorkerDeps): Promise { - const [job] = await deps.jobs.fetchPending(1); - if (!job) return false; - await deps.jobs.markInProgress(job.id); - try { - await deps.runReflect(job.userId, job.conversationId); - await deps.jobs.markCompleted(job.id); - } catch (e) { - const msg = e instanceof Error ? e.message : String(e); - await deps.jobs.markFailed(job.id, msg); - } - return true; -} - -export interface WorkerHandle { - stop: () => void; -} - -export function startReflectWorker(deps: JobsWorkerDeps, pollMs: number): WorkerHandle { - let stopped = false; - const tick = async (): Promise => { - if (stopped) return; - try { - const didWork = await processOnePendingJob(deps); - if (!didWork) { - await new Promise(r => setTimeout(r, pollMs)); - } - } catch (e) { - // Worker-level errors (DB conn drop, etc.) — log to stderr and back off. - console.error('[reflect-worker] unexpected error:', e); - await new Promise(r => setTimeout(r, pollMs * 2)); - } - if (!stopped) void tick(); - }; - void tick(); - return { stop: () => { stopped = true; } }; -} -``` - -- [ ] **Step 4: Verify pass** - -```bash -npx vitest run src/services/__tests__/reflect-jobs.test.ts -``` - -Expected: 3/3 PASS. - -- [ ] **Step 5: Commit** - -```bash -git add src/services/reflect-jobs.ts src/services/__tests__/reflect-jobs.test.ts -git commit -m "feat(reflect): add reflect-jobs worker with fail-closed job lifecycle" -``` - -## Task 1.7: Wire enqueue into `memory-ingest.ts` - -**Files:** -- Modify: `src/services/memory-ingest.ts` - -- [ ] **Step 1: Inspect current ingest entry point** - -```bash -grep -n "export async function\|export function" src/services/memory-ingest.ts | head -10 -``` - -Identify the function that wraps a complete ingest of one or more memories. - -- [ ] **Step 2: Add the enqueue call at the end of the ingest flow** - -Find the function that returns after AUDN commits. Inject a `reflectionJobs?` dependency and an `enqueueIfEnabled` call after the commit: - -```typescript -// near the top of the file, add to deps interface -import type { ReflectionJobsRepository } from '../db/reflection-jobs-repository.js'; - -// in the deps shape: -interface IngestDeps { - // ... existing ... - reflectionJobs?: ReflectionJobsRepository; - reflectEnabled: boolean; -} - -// at the end of the main ingest function, after the existing commit: -if (deps.reflectEnabled && deps.reflectionJobs) { - try { - await deps.reflectionJobs.enqueue(userId, conversationId); - } catch (e) { - // Enqueue failure must not block the ingest response. - console.error('[memory-ingest] reflection enqueue failed:', e); - } -} -``` - -Adapt to actual deps-shape and conversation_id source in the existing file. If `conversationId` is not currently threaded into ingest, thread it from the route handler. - -- [ ] **Step 3: tsc clean** - -```bash -npx tsc --noEmit -``` - -Expected: no errors. - -- [ ] **Step 4: Add a unit test that confirms enqueue is called** - -If `memory-ingest.ts` already has tests, extend them. Add one test (in the existing memory-ingest tests file) where `reflectEnabled=true` and `reflectionJobs.enqueue` is asserted called once with `(userId, conversationId)`. Add a complement where `reflectEnabled=false` and `enqueue` is NOT called. - -- [ ] **Step 5: Run tests** - -```bash -npx vitest run src/services/__tests__/memory-ingest.test.ts -``` - -(If the test file name differs, find the right one with `grep -rl memory-ingest src/services/__tests__/`.) - -Expected: PASS including the two new assertions. - -- [ ] **Step 6: Commit** - -```bash -git add src/services/memory-ingest.ts src/services/__tests__/memory-ingest*.ts -git commit -m "feat(ingest): enqueue reflection job after AUDN commit when REFLECT_ENABLED" -``` - -## Task 1.8: Implement `reflect-retrieval.ts` for query-time fetch - -**Files:** -- Create: `src/services/reflect-retrieval.ts` -- Create: `src/services/__tests__/reflect-retrieval.test.ts` - -- [ ] **Step 1: Write the failing test** - -```typescript -// src/services/__tests__/reflect-retrieval.test.ts -import { describe, expect, it, vi } from 'vitest'; -import { fetchReflectionsForQuery, type ReflectRetrievalDeps } from '../reflect-retrieval.js'; -import { QuestionType } from '../answer-format.js'; - -const reflection = (text: string): any => ({ - id: 'r1', userId: 'u', conversationId: 'c', observation: text, - observationType: 'event_summary', evidenceMemoryIds: ['m1'], - embedding: [], createdAt: new Date(), -}); - -describe('fetchReflectionsForQuery', () => { - it('returns empty when reflect retrieval disabled', async () => { - const deps: ReflectRetrievalDeps = { - reflections: { findSimilar: vi.fn() } as any, - embed: vi.fn(), - topK: 5, - enabled: false, - }; - const out = await fetchReflectionsForQuery(deps, 'u', 'How many?', QuestionType.NUMERIC_COUNT); - expect(out).toEqual([]); - expect(deps.reflections.findSimilar).not.toHaveBeenCalled(); - }); - - it('returns empty when question type is OTHER', async () => { - const deps: ReflectRetrievalDeps = { - reflections: { findSimilar: vi.fn() } as any, - embed: vi.fn(), - topK: 5, - enabled: true, - }; - const out = await fetchReflectionsForQuery(deps, 'u', 'unrelated', QuestionType.OTHER); - expect(out).toEqual([]); - expect(deps.reflections.findSimilar).not.toHaveBeenCalled(); - }); - - it('embeds and fetches top-K when type is in the routed set', async () => { - const findSimilar = vi.fn().mockResolvedValue([reflection('R1'), reflection('R2')]); - const embed = vi.fn().mockResolvedValue([0.1, 0.2]); - const deps: ReflectRetrievalDeps = { - reflections: { findSimilar } as any, embed, topK: 5, enabled: true, - }; - const out = await fetchReflectionsForQuery(deps, 'u', 'Summary please.', QuestionType.SUMMARY); - expect(embed).toHaveBeenCalledWith('Summary please.'); - expect(findSimilar).toHaveBeenCalledWith('u', [0.1, 0.2], 5); - expect(out).toHaveLength(2); - }); -}); -``` - -- [ ] **Step 2: Verify fail** - -```bash -npx vitest run src/services/__tests__/reflect-retrieval.test.ts -``` - -Expected: FAIL. - -- [ ] **Step 3: Implement** - -```typescript -// src/services/reflect-retrieval.ts -/** - * Query-time reflection retrieval. When the question classifier returns one of - * the "synthesis-heavy" types (summary, contradiction, preference, ...), this - * module embeds the query and pulls top-K reflections by cosine similarity. - * The result is later emitted as a ## OBSERVATIONS prompt channel by - * retrieval-format.ts. - * - * Returns [] when disabled or when the question type is OTHER — the caller - * passes the empty array through and downstream packaging emits no - * observations block. - */ -import type { ReflectionsRepository, Reflection } from '../db/reflections-repository.js'; -import { QuestionType } from './answer-format.js'; - -const ROUTED_TYPES: ReadonlySet = new Set([ - QuestionType.SUMMARY, - QuestionType.CONTRADICTION, - QuestionType.PREFERENCE, - QuestionType.NUMERIC_COUNT, - QuestionType.EXACT_DATE, - QuestionType.ORDERED_LIST, -]); - -export interface ReflectRetrievalDeps { - reflections: Pick; - embed: (text: string) => Promise; - topK: number; - enabled: boolean; -} - -export async function fetchReflectionsForQuery( - deps: ReflectRetrievalDeps, - userId: string, - query: string, - questionType: QuestionType, -): Promise { - if (!deps.enabled) return []; - if (!ROUTED_TYPES.has(questionType)) return []; - const embedding = await deps.embed(query); - return deps.reflections.findSimilar(userId, embedding, deps.topK); -} -``` - -- [ ] **Step 4: Verify pass** - -```bash -npx vitest run src/services/__tests__/reflect-retrieval.test.ts -``` - -Expected: 3/3 PASS. - -- [ ] **Step 5: Commit** - -```bash -git add src/services/reflect-retrieval.ts src/services/__tests__/reflect-retrieval.test.ts -git commit -m "feat(reflect): add query-time reflection retrieval gated by question type" -``` - -## Task 1.9: Wire reflection retrieval into `search-pipeline.ts` - -**Files:** -- Modify: `src/services/search-pipeline.ts` - -- [ ] **Step 1: Identify where the final retrieval result is assembled** - -```bash -grep -n "applyExpansionAndReranking\|surfaced\|selected\|topK\|return" src/services/search-pipeline.ts | head -30 -``` - -Identify the function/section where the final selected set is passed downstream. - -- [ ] **Step 2: Thread reflections into the pipeline output** - -Add an `reflections` field to the pipeline output type (or to the deps that flow to retrieval-format). Call `fetchReflectionsForQuery` with the query, classified type, and configured top-K. Attach the array (possibly empty) to the result. - -Insertion shape (illustrative): - -```typescript -import { fetchReflectionsForQuery } from './reflect-retrieval.js'; -import { classifyQuestion } from './answer-format.js'; - -// in the orchestration function, after `selected = await applyExpansionAndReranking(...)`: -const reflections = await fetchReflectionsForQuery( - { - reflections: deps.stores.reflections!, - embed: deps.embed, - topK: deps.config.reflectRetrievalTopK, - enabled: deps.config.reflectEnabled, - }, - userId, - query, - classifyQuestion(query), -); -// pass `reflections` downstream alongside `selected` -``` - -- [ ] **Step 3: tsc clean** - -```bash -npx tsc --noEmit -``` - -- [ ] **Step 4: Smoke test the pipeline still compiles end-to-end** - -```bash -npm run build 2>&1 | tail -5 -``` - -Expected: build succeeds. - -- [ ] **Step 5: Commit** - -```bash -git add src/services/search-pipeline.ts -git commit -m "feat(search-pipeline): wire reflection retrieval branch into orchestration" -``` - -## Task 1.10: Emit `## OBSERVATIONS` channel in `retrieval-format.ts` - -**Files:** -- Modify: `src/services/retrieval-format.ts` -- Modify: `src/services/__tests__/retrieval-format.test.ts` - -- [ ] **Step 1: Write the failing test** - -In `src/services/__tests__/retrieval-format.test.ts`, add a new describe block: - -```typescript -import type { Reflection } from '../../db/reflections-repository.js'; - -const sampleReflection = (text: string, type: any = 'event_summary'): Reflection => ({ - id: 'r1', userId: 'u', conversationId: 'c', - observation: text, observationType: type, - evidenceMemoryIds: ['m1', 'm2'], - embedding: [], createdAt: new Date(), -}); - -describe('buildInjection with reflections', () => { - it('emits ## OBSERVATIONS section when reflections array is non-empty', () => { - const out = buildInjection({ - // existing required args — adapt to the existing helper signature - memories: [], - reflections: [sampleReflection('Observation 1')], - // ... other args as currently required - } as any); - expect(out).toContain('## OBSERVATIONS'); - expect(out).toContain('Observation 1'); - expect(out).toContain('event_summary'); - }); - - it('omits the OBSERVATIONS section when reflections array is empty', () => { - const out = buildInjection({ - memories: [], - reflections: [], - } as any); - expect(out).not.toContain('## OBSERVATIONS'); - }); -}); -``` - -- [ ] **Step 2: Verify fail** - -```bash -npx vitest run src/services/__tests__/retrieval-format.test.ts -t "buildInjection with reflections" -``` - -Expected: FAIL. - -- [ ] **Step 3: Implement the channel emission** - -In `src/services/retrieval-format.ts`, extend the `buildInjection` argument type to accept `reflections?: readonly Reflection[]`. Add a small helper: - -```typescript -import type { Reflection } from '../db/reflections-repository.js'; - -function buildObservationsChannel(reflections: readonly Reflection[] | undefined): string { - if (!reflections || reflections.length === 0) return ''; - const lines = reflections.map(r => { - const evidence = r.evidenceMemoryIds.join(', '); - return `- [${r.observationType}] ${r.observation}\n evidence: ${evidence}`; - }); - return `## OBSERVATIONS\n${lines.join('\n')}\n\n`; -} -``` - -In the main assembly path of `buildInjection`, prepend the result of `buildObservationsChannel(args.reflections)` to the existing injection text (BEFORE `## TIMELINE` or after, your choice — but consistent). - -- [ ] **Step 4: Verify pass** - -```bash -npx vitest run src/services/__tests__/retrieval-format.test.ts -``` - -Expected: all tests PASS (existing + new 2). - -- [ ] **Step 5: Commit** - -```bash -git add src/services/retrieval-format.ts src/services/__tests__/retrieval-format.test.ts -git commit -m "feat(retrieval-format): emit ## OBSERVATIONS channel for reflections" -``` - -## Task 1.11: Add config flags + flush route - -**Files:** -- Modify: `src/config.ts` -- Create: `src/routes/reflect.ts` - -- [ ] **Step 1: Add the config keys** - -In `src/config.ts`, inside the `RuntimeConfig` interface, add: - -```typescript - reflectEnabled: boolean; - reflectModel: string; - reflectMaxObservations: number; - reflectJobPollMs: number; - reflectDebounceMs: number; - reflectRetrievalTopK: number; -``` - -In the config constructor function, parse them: - -```typescript - reflectEnabled: (optionalEnv('REFLECT_ENABLED') ?? 'false') === 'true', - reflectModel: optionalEnv('REFLECT_MODEL') ?? 'claude-sonnet-4-5', - reflectMaxObservations: parseInt(optionalEnv('REFLECT_MAX_OBSERVATIONS') ?? '12', 10), - reflectJobPollMs: parseInt(optionalEnv('REFLECT_JOB_POLL_MS') ?? '5000', 10), - reflectDebounceMs: parseInt(optionalEnv('REFLECT_DEBOUNCE_MS') ?? '60000', 10), - reflectRetrievalTopK: parseInt(optionalEnv('REFLECT_RETRIEVAL_TOP_K') ?? '5', 10), -``` - -Add to `INTERNAL_POLICY_CONFIG_FIELDS` if other flags follow that pattern. - -In `src/app/runtime-container.ts`, mirror the fields in `CoreRuntimeConfig`. - -- [ ] **Step 2: Implement the flush route** - -```typescript -// src/routes/reflect.ts -/** - * Synchronous reflect-flush endpoint for benchmark / eval mode. - * Processes all pending reflection_jobs serially and returns the count - * of jobs processed. Returns 503 if Reflect is disabled. - */ -import type { Request, Response } from 'express'; -import type { JobsWorkerDeps } from '../services/reflect-jobs.js'; -import { processOnePendingJob } from '../services/reflect-jobs.js'; - -export function makeReflectFlushHandler( - deps: JobsWorkerDeps, - enabled: boolean, -): (req: Request, res: Response) => Promise { - return async (_req, res) => { - if (!enabled) { - res.status(503).json({ error: 'reflect_disabled' }); - return; - } - let processed = 0; - let cap = 1000; - while (cap-- > 0) { - const did = await processOnePendingJob(deps); - if (!did) break; - processed++; - } - res.json({ processed }); - }; -} -``` - -Mount it in `server.ts` or wherever routes are registered: `POST /v1/reflect/flush`. - -- [ ] **Step 3: tsc clean** - -```bash -npx tsc --noEmit -``` - -- [ ] **Step 4: Commit** - -```bash -git add src/config.ts src/app/runtime-container.ts src/routes/reflect.ts src/server.ts -git commit -m "feat(reflect): config flags + POST /v1/reflect/flush sync endpoint" -``` - -## Task 1.12: Wire dependencies in `runtime-container.ts` and start the worker - -**Files:** -- Modify: `src/app/runtime-container.ts` -- Modify: `src/db/stores.ts` - -- [ ] **Step 1: Add the two repos to the stores bundle** - -In `src/db/stores.ts`, extend `CoreStores`: - -```typescript -import { ReflectionsRepository } from './reflections-repository.js'; -import { ReflectionJobsRepository } from './reflection-jobs-repository.js'; - -export interface CoreStores { - // ... existing fields ... - reflections: ReflectionsRepository | null; - reflectionJobs: ReflectionJobsRepository | null; -} -``` - -- [ ] **Step 2: Instantiate them in `createCoreRuntime`** - -In `src/app/runtime-container.ts`, after the other repository instantiations: - -```typescript -import { ReflectionsRepository } from '../db/reflections-repository.js'; -import { ReflectionJobsRepository } from '../db/reflection-jobs-repository.js'; -import { runReflectForConversation } from '../services/reflect.js'; -import { startReflectWorker } from '../services/reflect-jobs.js'; -import { embed as embedQuery } from '../services/embedding.js'; // adapt to actual API - -const reflections = runtimeConfig.reflectEnabled ? new ReflectionsRepository(pool) : null; -const reflectionJobs = runtimeConfig.reflectEnabled ? new ReflectionJobsRepository(pool) : null; - -// ... after building stores ... -stores.reflections = reflections; -stores.reflectionJobs = reflectionJobs; - -if (runtimeConfig.reflectEnabled && reflections && reflectionJobs) { - const reflectModel = runtimeConfig.reflectModel; - const workerDeps = { - jobs: reflectionJobs, - runReflect: (userId: string, conversationId: string) => - runReflectForConversation( - { - fetchMemories: async (u, c) => { - const rows = await memory.findByConversation(u, c); - return rows.map(r => ({ id: r.id, text: r.text, observedAt: r.observedAt })); - }, - llmCallTool: (system, user, toolSchema) => - callAnthropicTool(reflectModel, system, user, toolSchema), - embed: embedQuery, - reflections, - maxObservations: runtimeConfig.reflectMaxObservations, - }, - userId, - conversationId, - ), - }; - startReflectWorker(workerDeps, runtimeConfig.reflectJobPollMs); -} -``` - -### Sub-step 2a — Add `findByConversation` to `MemoryRepository` if missing - -In `src/db/memory-repository.ts`, add (skip if it already exists): - -```typescript -async findByConversation( - userId: string, - conversationId: string, -): Promise> { - const { rows } = await this.pool.query( - `SELECT id, content as text, observed_at - FROM memories - WHERE user_id = $1 AND conversation_id = $2 - ORDER BY observed_at ASC`, - [userId, conversationId], - ); - return rows.map(r => ({ id: r.id, text: r.text, observedAt: r.observed_at })); -} -``` - -(Adapt the column name `content`/`text` and `conversation_id` to whatever the existing memories schema uses — `\d memories` to confirm.) - -### Sub-step 2b — Add `callAnthropicTool` to `services/llm.ts` if missing - -```typescript -import Anthropic from '@anthropic-ai/sdk'; -import { config } from '../config.js'; - -const client = new Anthropic({ apiKey: config.anthropicApiKey }); - -interface AnthropicToolSchema { - name: string; - description: string; - input_schema: Record; -} - -export async function callAnthropicTool( - model: string, - system: string, - user: string, - toolSchema: AnthropicToolSchema, -): Promise { - const response = await client.messages.create({ - model, - max_tokens: 4096, - system, - messages: [{ role: 'user', content: user }], - tools: [toolSchema], - tool_choice: { type: 'tool', name: toolSchema.name }, - }); - for (const block of response.content) { - if (block.type === 'tool_use' && block.name === toolSchema.name) { - return block.input as T; - } - } - throw new Error(`Anthropic tool-use returned no ${toolSchema.name} block`); -} -``` - -(Adapt `apiKey` to whatever shape the existing config exposes — it's likely `config.llmApiKey` based on prior env-files.) - -- [ ] **Step 3: tsc clean** - -```bash -npx tsc --noEmit -``` - -- [ ] **Step 4: Run all tests** - -```bash -npm test 2>&1 | tail -20 -``` - -Expected: PASS. - -- [ ] **Step 5: Commit** - -```bash -git add -A -git commit -m "feat(reflect): wire repos + worker in runtime-container, expose via stores" -``` - -## Task 1.13: Author the Phase 1 validation env file - -**Files:** -- Create: `.env.phase1-reflect` - -- [ ] **Step 1: Write the env file** - -```bash -POSTGRES_PORT=5503 -APP_PORT=3103 -DATABASE_URL=postgresql://atomicmemory:atomicmemory@localhost:5503/atomicmemory -LLM_PROVIDER=anthropic -LLM_API_URL= -LLM_API_KEY= -LLM_MODEL=claude-haiku-4-5 -EMBEDDING_PROVIDER=transformers -EMBEDDING_MODEL=Xenova/all-MiniLM-L6-v2 -EMBEDDING_DIMENSIONS=384 -ANTHROPIC_API_KEY= -ATOMICMEMORY_API_URL=http://localhost:3103 -COST_CAP_DAILY=200 -COST_CAP_ITER=20 -# Kept stack (h3-timeline) + L1-patched -TBC_ENABLED=true -TOPIC_ABSTRACTION_ENABLED=false -TOPIC_SEARCH_ENABLED=false -RERANKER_ENABLED=true -RECAP_LAYER_ENABLED=false -RECAP_SEARCH_ENABLED=false -HIERARCHICAL_RETRIEVAL_ENABLED=false -CHUNKED_EXTRACTION_ENABLED=true -CHUNKED_EXTRACTION_FALLBACK_ENABLED=true -TIMELINE_CHANNEL_ENABLED=true -PACKAGING_USE_OBSERVED_AT=true -ANSWER_FORMAT_ALIGNMENT_ENABLED=true -# Phase 1 — Reflect ON -REFLECT_ENABLED=true -REFLECT_MODEL=claude-sonnet-4-5 -REFLECT_MAX_OBSERVATIONS=12 -REFLECT_JOB_POLL_MS=5000 -REFLECT_DEBOUNCE_MS=10000 -REFLECT_RETRIEVAL_TOP_K=5 -``` - -- [ ] **Step 2: DO NOT commit this file** - -`.env.*` files are blocked by `.gitignore` (real API keys). File exists on -disk for docker; it never enters git. - -## Task 1.14: Run Phase 1 4-conv n=80 validation - -- [ ] **Step 1: Verify ports 3103 and 5503 are free** - -```bash -for p in 3103 5503; do - if lsof -nP -iTCP:$p -sTCP:LISTEN >/dev/null 2>&1; then echo "$p BUSY"; else echo "$p free"; fi -done -``` - -- [ ] **Step 2: Modify the runner to call `/v1/reflect/flush` between ingest and query phases** - -The current `run_parallel_cell.sh` does ingest and query inside one `omb run` invocation. We need a hook to call `POST http://localhost:$PORT/v1/reflect/flush` after the harness ingests the conversation but before it starts asking questions. Two options: - -(a) **Easiest:** add `BEAM_POST_INGEST_HOOK_URL` to the omb harness env (modify `agent-memory-benchmark/src/memory_bench/run.py`) and have it `requests.post` after ingest. - -(b) **No-harness-change:** patch the AM `/v1/memories` endpoint to also call `processOnePendingJob` synchronously when an env flag is on (only used in eval mode). Acceptable hack for the validation. - -Pick (b) for speed. Add to `routes/memories.ts` (the ingest handler): if `runtimeConfig.reflectEnabled` AND the post-ingest hook flag is set, await one job-drain cycle before responding. Document the flag clearly: `REFLECT_SYNC_DRAIN_ON_INGEST=true` (eval-mode only). - -- [ ] **Step 3: Run the validation** - -```bash -/Users/moralespanitz/me/supernet/atomicmemory-research/memory-research/benchmarks-sprint3/tools/run_parallel_cell.sh \ - phase1-reflect .env.phase1-reflect 3103 5503 am-phase1-reflect \ - 1,2,3,4 anthropic-haiku-4-5 anthropic-haiku-4-5 -``` - -Expected wall time: ~35 min (Reflect adds ~10 min for the Sonnet calls). Expected cost: ~$4 (Haiku $2 + Sonnet $2). - -- [ ] **Step 4: Per-question diff vs Phase 0 baseline** - -```bash -python3 << 'PYEOF' -import json -prev = json.load(open('/Users/moralespanitz/me/supernet/atomicmemory-research/memory-research/benchmarks-sprint3/results/haiku080/phase0-l1patched/summary.json')) -new = json.load(open('/Users/moralespanitz/me/supernet/atomicmemory-research/memory-research/benchmarks-sprint3/results/haiku080/phase1-reflect/summary.json')) -print(f"composite: {prev['composite']:.3f} -> {new['composite']:.3f} (delta {new['composite']-prev['composite']:+.3f})") -print("per-ability:") -for ab in sorted(prev['per_ability']): - p, n = prev['per_ability'][ab], new['per_ability'][ab] - print(f" {ab:25s}: {p:.3f} -> {n:.3f} (delta {n-p:+.3f})") -PYEOF -``` - -- [ ] **Step 5: Apply Phase 1 strict gate** - -| Composite Δ | Worst per-ability Δ | Verdict | -|---|---|---| -| ≥ +0.05 | ≥ -0.10 | **PASS** — Phase 1 ships, proceed to Phase 2 brainstorm | -| Composite plateau, but SUM/KU/CR/MSR each ≥ +0.05 | ≥ -0.10 | **PASS** (per-ability win) — proceed to Phase 2 | -| ∈ [-0.03, +0.05] otherwise | ≥ -0.10 | **PLATEAU** — diagnose, document, decide keep-flagged-off vs continue-on | -| < -0.03 OR any ability < -0.10 | — | **REGRESS** — revert Phase 1 commits, root-cause, consult user | - -- [ ] **Step 6: Multirun if borderline** - -If composite Δ is within ±0.03 of the +0.05 threshold, run 3 seeds × n=80 to kill noise: - -```bash -for seed in 1 2 3; do - /Users/moralespanitz/me/supernet/atomicmemory-research/memory-research/benchmarks-sprint3/tools/run_parallel_cell.sh \ - phase1-reflect-s${seed} .env.phase1-reflect 3103 5503 am-phase1-reflect-s${seed} \ - 1,2,3,4 anthropic-haiku-4-5 anthropic-haiku-4-5 -done -``` - -Compute the 9-cell mean and the bootstrap 95% CI. Decide based on the CI lower bound. - -- [ ] **Step 7: Write the Phase 1 diagnostic doc** - -Create `/Users/moralespanitz/me/supernet/atomicmemory-research/memory-research/benchmarks-sprint5/phase-1-diagnostic.md`: -- Composite + per-ability before/after -- Sample of Reflect outputs (read 5 random rows from `session_reflections`) -- Per-question diff: which question types lifted, which regressed -- Verdict (PASS / PLATEAU / REGRESS) -- Next step - -- [ ] **Step 8: Commit** - -```bash -cd /Users/moralespanitz/me/supernet/atomicmemory-research -git add memory-research/benchmarks-sprint3/results/haiku080/phase1-reflect/ \ - memory-research/benchmarks-sprint5/phase-1-diagnostic.md -git commit -m "results: Phase 1 4-conv n=80 validation" -``` - -- [ ] **Step 9: Final gate** - -If PASS or PASS (per-ability win): announce ready for Phase 2 (specialists). The next implementation plan begins with another brainstorm → spec update → writing-plans cycle. - -If PLATEAU: diagnose. Common Reflect-specific failure modes to check first: -- Reflect not producing observations (check `session_reflections` row count) -- Observations not being retrieved (check `## OBSERVATIONS` appearing in query logs) -- Observations contradicting raw memories (per-question diff shows answer flipping wrong) -- Sonnet voice mismatch (try `REFLECT_MODEL=claude-haiku-4-5` and re-validate) - -If REGRESS: revert all Phase 1 commits (`git revert HEAD~N..HEAD` where N is the Phase 1 task count), open a diagnostic, consult user. - ---- - -## Plan-level pre-commit checklist (after Phase 1 lands) - -Before declaring Phase 1 done: - -- [ ] `npx tsc --noEmit` clean -- [ ] `npm test` all pass (no skipped suites) -- [ ] `fallow --no-cache` clean (or remaining items consciously deferred and documented) -- [ ] All new files ≤ 400 lines, all new functions ≤ 40 lines -- [ ] No `any`, no direct `process.env` reads -- [ ] Phase 0 + Phase 1 diagnostic docs committed in `benchmarks-sprint5/` -- [ ] Both `summary.json` files committed in `benchmarks-sprint3/results/haiku080/` - -## Out of scope (deferred to later plans) - -- Per-ability specialists (Phase 2 — separate plan) -- Hybrid model routing (Phase 3) -- TEMPR graph arm (Phase 4) -- Mental Models / Mission-Directives (Phase 5) -- Reflect storage TTL / compaction -- BEAM-1M / BEAM-10M tier validation -- Multi-instance worker leasing (single-instance only for v1) diff --git a/docs/superpowers/specs/2026-05-11-beam-085-anthropic-only-design.md b/docs/superpowers/specs/2026-05-11-beam-085-anthropic-only-design.md deleted file mode 100644 index 939883c..0000000 --- a/docs/superpowers/specs/2026-05-11-beam-085-anthropic-only-design.md +++ /dev/null @@ -1,244 +0,0 @@ -# BEAM 0.85+ via Anthropic-Only Hybrid Architecture — Design - -**Date:** 2026-05-11 -**Author:** AtomicStrata research (Claude + Moralespanitz) -**Status:** Spec — pending user approval before transition to writing-plans -**Target:** Composite ≥0.85 on BEAM-100K (stretch 0.90+) under Anthropic-judge, beating Hindsight (0.75 published, Gemini-judge) and Mem0 on the public leaderboard. - -## Goal - -Lift AtomicMemory's BEAM-100K composite from **0.411 (strict Haiku × Haiku-judge, kept stack)** to **0.85–0.92** without leaving Anthropic's model family for the pipeline. Quality is the primary objective; Pareto position is secondary but tracked as a constraint (target: stay below Hindsight's published $0.075/q cost). Reach the BEAM-1M and BEAM-10M tiers as a stretch goal. - -## Architecture - -Approach C — **shared spine + per-ability specialists for the bottom 3 abilities**. The current AM pipeline (RRF + reranker + TBC + timeline + packaging + Haiku answer) is preserved as the default "shared spine." A new async **Reflect step** consolidates session memories at ingest time. A new **question-type router** dispatches MSR / CR / KU+IE queries to specialist branches that bypass parts of the shared spine for those specific question types. All other questions take the shared spine unchanged. - -### Diagram - -``` -Ingest path - user turn → AUDN (Haiku) → memories table - └─→ literal-extractor (Haiku) → entity_values table - └─→ [async, session boundary] - Reflect (Sonnet) → session_reflections table - -Retrieve path - query → question-type classifier (deterministic regex) - ├─ shared spine (default): - │ RRF (sem + BM25 + temporal) → reranker → kept stack - │ packaging (+ ## TIMELINE + ## OBSERVATIONS if reflect retrieved) - │ → Haiku answer with L1 format-aligned prompt - ├─ MSR specialist (/how many|total|across all/): - │ retrieve → memory-aggregate (group by entity) → - │ Haiku with answer_with_count tool-use call - ├─ CR specialist (/have I ever|conflicting/): - │ retrieve → bilateral COUNTER fetch (both sides) → - │ Haiku with answer_contradiction tool-use call (FACT A / FACT B framing) - └─ KU/IE specialist (/what is the|when does/): - entity_values SQL lookup → hit: literal value - miss: fall through to shared spine -``` - -### Components (new modules, all ≤400 lines) - -| Module | Role | Phase | -|---|---|---| -| `services/reflect.ts` | Reflect orchestrator | 1 | -| `services/reflect-prompts.ts` | Sonnet system prompt + tool-use schema | 1 | -| `services/reflect-jobs.ts` | Postgres-backed async job queue | 1 | -| `services/reflect-retrieval.ts` | Query-time reflection fetch | 1 | -| `db/reflections-repository.ts` | CRUD for session_reflections | 1 | -| `db/reflection-jobs-repository.ts` | CRUD for reflection_jobs | 1 | -| `db/migrations/20260512_session_reflections.sql` | Schema | 1 | -| `services/specialists/question-router.ts` | Deterministic classifier + dispatch | 2 | -| `services/specialists/msr-specialist.ts` | MSR aggregator + count tool-use | 2.1 | -| `services/specialists/cr-specialist.ts` | CR bilateral framing + tool-use | 2.2 | -| `services/specialists/ku-ie-specialist.ts` | Literal SQL lookup | 2.3 | -| `services/specialists/specialist-types.ts` | Shared specialist types | 2 | -| `services/literal-extractor.ts` | Ingest-side literal-field extraction | 2.3 | -| `db/entity-values-repository.ts` | CRUD for entity_values | 2.3 | -| `db/migrations/20260513_entity_values.sql` | Schema | 2.3 | -| `services/model-router.ts` | Per-ability model selection (Phase 3) | 3 | -| `services/graph-arm.ts` | TEMPR 4th retrieval arm via belief_edges | 4 | - -### Modified files - -| File | Change | Phase | -|---|---|---| -| `services/answer-format.ts` | Patch ORDERED_LIST hint; tighten classifier (require numeric token) | 0 | -| `services/counter-edge-surface.ts` | **DELETE** — replaced by CR specialist | 0 | -| `config.ts` | New flags: `REFLECT_ENABLED`, `SPECIALIST_*_ENABLED`, `LITERAL_EXTRACTOR_ENABLED`, etc. | per phase | -| `services/memory-ingest.ts` | Call literal-extractor + write reflection_job after AUDN | 1, 2.3 | -| `services/memory-search.ts` | Dispatch via question-router; integrate reflection-retrieval | 1, 2 | -| `services/search-pipeline.ts` | Add reflection retrieval branch; add graph arm in Phase 4 | 1, 4 | -| `services/retrieval-format.ts` | Emit `## OBSERVATIONS` prompt channel | 1 | -| `app/runtime-container.ts` | Wire new repositories + job worker | 1+ | - -## Data flow - -### Ingest -1. User turn arrives via HTTP `POST /v1/memories`. -2. AUDN extracts atomic facts (Haiku call, current behavior preserved). -3. **NEW** (Phase 2.3): literal-extractor runs on each new fact, extracting `(entity, attribute, value, value_type, observed_at)` tuples into `entity_values`. -4. **NEW** (Phase 1): a `reflection_jobs` row is written for the affected `(user_id, conversation_id)`. Status = `pending`. Response returns to caller immediately. -5. **NEW** (Phase 1, async): worker polls `reflection_jobs WHERE status = 'pending' AND age > debounce_threshold`. For each ready job, fetches all memories for the conversation, calls Sonnet with the Reflect prompt + tool-use schema. Writes resulting observations to `session_reflections`. Marks job `completed`. - -### Retrieve -1. Query arrives via `POST /v1/memories/search`. -2. Question-type classifier (deterministic, no LLM) inspects query → returns one of `{msr, cr, ku_ie, shared}`. -3. Dispatch: - - `shared`: existing RRF + rerank + packaging + Haiku. PLUS — if classifier sub-flag `summary_or_preference_or_knowledge_update` matches, also fetch top-5 reflections via cosine similarity and emit `## OBSERVATIONS` channel in packaging. - - `msr`: shared retrieval + memory-aggregate post-process + Haiku tool-use call `answer_with_count`. - - `cr`: shared retrieval + bilateral COUNTER fetch (query `belief_edges` for both directions) + Haiku tool-use call `answer_contradiction`. - - `ku_ie`: query `entity_values` directly via SQL. On hit: return literal value in minimal answer template. On miss: fall through to shared spine. -4. Answer returned. Telemetry records which branch was taken. - -## Conditional cases / phase decision tree - -Sequential phases. Each phase gates the next. - -### Strict gate definition - -- **PASS:** composite Δ ≥ +0.05 vs prev-phase baseline AND no per-ability regression > 0.10 at 4-conv n=80. -- **PLATEAU:** composite Δ ∈ [−0.03, +0.05] with no ability < −0.10. **Diagnose via per-question diff** before deciding. May ship behind flag without claim. -- **REGRESS:** composite Δ < −0.03 OR any ability < −0.10. **Revert immediately**, root-cause, then optionally retry with modification. -- **Borderline:** for composite Δ within ±0.03 of the +0.05 threshold, run multirun (3 seeds × n=80) to kill single-run noise. - -### Phase tree - -``` -Phase 0 — L1-patched, L3-deleted, relock baseline - PASS → Phase 1 - PLATEAU → keep, no claim → Phase 1 - REGRESS → revert L1, investigate (ask user) - -Phase 1 — Reflect step (async, Sonnet) - PASS → Phase 2.1 - PLATEAU on composite, per-ability lifts on SUM/KU/CR/MSR ≥+0.05 each → keep ON → Phase 2.1 - PLATEAU else → flag OFF, diagnose, retry with modified prompt - REGRESS → OFF, diagnose: - - bad observations (evidence cite validity?) - - prompt-slot competition (verify ## OBSERVATIONS only fires when routed)? - - voice mismatch (try Haiku for Reflect)? - -Phase 2.1 — MSR specialist - PASS (MSR Δ ≥+0.15 AND composite Δ ≥+0.03) → Phase 2.2 - PLATEAU → ablate: classifier accuracy? aggregator grouping? - REGRESS → OFF, no retry without root cause - -Phase 2.2 — CR specialist - PASS (CR Δ ≥+0.15 AND composite Δ ≥+0.02) → Phase 2.3 - PLATEAU → ablate: COUNTER edges sparse? tool-use schema ambiguous? - REGRESS → OFF, investigate - -Phase 2.3 — KU/IE specialist + literal-extractor - PASS (KU OR IE Δ ≥+0.20) → Phase 3 - PLATEAU → ablate: entity_values population rate? (entity, attribute) extraction accuracy? - REGRESS → OFF, investigate - -Phase 3 — Hybrid model routing (Sonnet/Opus for hard abilities) - PASS (composite Δ ≥+0.05) → Phase 4 - PLATEAU → stay on Haiku (Sprint 3 documented: stronger LLM hurts strict-judge) - REGRESS → all-Haiku, document the model-vs-judge pattern - -Phase 4 — TEMPR 4th arm (graph retrieval) - PASS → Phase 5 - PLATEAU → OFF, document - REGRESS → OFF - -Phase 5 — Mental Models + Mission/Directives (polish) - PASS → DONE, run 4-conv multirun + BEAM-1M / BEAM-10M tiers - PLATEAU → DONE at current best - REGRESS → revert, claim previous best -``` - -### Cross-phase rules - -1. **Never stack two unvalidated changes.** Each phase ships in isolation, validated solo, then stacks. -2. **Never ship a mechanism that regresses any ability by >0.15** even if composite lifts. (L3 regressed CR by −0.188 while *targeting* CR — exact failure pattern this rule prevents.) -3. **Always keep previous-known-good as rollback.** Feature branch per phase, merge to `main` only after phase PASS. -4. **Always run per-question diff** before claiming PASS — composite numbers can hide mechanism-level damage (dedup looked +0.024 single-conv, was −0.024 at 4-conv). -5. **Rate-limit guard:** if Anthropic rate-limit pool exhausts during a phase, pause and reschedule. Do not parallelize harder. - -### Escalation ladder - -- **Phase regresses 3× with different mods** → architecture review: consider pivoting from Approach C → Approach B (full per-ability refactor) for the stuck ability, or scope down + skip. -- **Composite plateaus < 0.65 after Phase 3** → likely model-capacity bound. Add Phase 3.5: hybrid with Sonnet for ALL answer generation, validate. -- **Composite plateaus < 0.75 after Phase 4** → likely judge-calibration bound. Document the ceiling. Consider Sonnet-as-judge as an alternative anchor. - -## Error handling (per-mechanism) - -| Mechanism | Failure mode | Mitigation | -|---|---|---| -| Reflect | LLM call fails | Retry 3× exponential backoff. Then mark job `failed`, no reflections written. Query path unaffected. | -| Reflect | Hallucinated observation | Every observation cites `evidence_memory_ids`. Reflections with missing/invalid evidence filtered at retrieval. | -| Reflect | Contradicts raw memory | Raw memories always ground truth. Reflections are supplementary, never override. | -| Reflect | Storage growth | Out of scope for v1 (deferred to a separate compaction sprint). | -| MSR specialist | Aggregator returns 0 items | Fall through to shared spine. | -| MSR specialist | Tool-use call fails | Retry once, then fall through. | -| CR specialist | No COUNTER edges | Fall through. | -| CR specialist | Both sides missing | Fall through. | -| KU/IE specialist | entity_values miss | Fall through to shared spine (this will be common until table fills). | -| Question router | Classifier crash | Default to `shared` (fail-open routing is safe here). | -| Hybrid model router | Sonnet/Opus rate limit | Downgrade routed queries to Haiku for this request only; log + alert. | - -## Testing - -### Test types - -- **Unit (Vitest, in-tree):** every new module gets a `__tests__` file. Coverage: classifier regexes, aggregator grouping, Reflect prompt assembly, tool-use schema validation, repo CRUD. -- **Integration (Vitest + Postgres test DB):** Reflect end-to-end (ingest → job → worker → DB). Specialist dispatch (query → router → specialist → answer). Fall-through behavior (specialist miss → shared spine). -- **Smoke benchmark** (conv 2 n=20, ~5 min, $0.50): after each mechanism, before 4-conv. Catches catastrophic regressions. Never used to claim PASS. -- **Validation benchmark** (4-conv n=80, ~25 min, $2): the gate. Produces composite + per-ability summary.json. -- **Multirun (borderline)** (3 seeds × n=80, ~75 min, $6): when composite Δ within ±0.03 of +0.05 threshold. -- **Per-question diff** (Python over c{1,2,3,4}.json): after every validation run. Surfaces which questions moved and why. - -### Validation artifacts per phase - -Each phase produces: -- Branch: `feature/phase-N-{mechanism}` -- Results JSON: `benchmarks-sprint5/results/phase-N/summary.json` -- 1-page diagnostic: `benchmarks-sprint5/phase-N-diagnostic.md` (per-question diff + decision) -- Merge commit on `main` (only after PASS) OR revert commit (on REGRESS) - -## Risks - -1. **Reflect hallucinations contaminate retrieval.** → evidence_memory_ids validation + fail-closed. -2. **Specialist classifier misfires** (false-positive routing). → 100% fall-through on miss + verbose dispatch logging + per-question diff catches it. -3. **Stronger LLM hurts strict-judge composite** (documented Sprint 3 pattern). → Phase 3 has explicit "stay on Haiku" branch if regression. -4. **Approach C plateaus below 0.75.** → Phase 5 escalation ladder includes architecture review + Approach B pivot for stuck abilities. -5. **Anthropic rate limits during multirun.** → pause + reschedule, not parallelize harder. Budget per phase: max ~$15 in API calls. - -## Out of scope - -- LongMemEval-S, LoCoMo10, PersonaMem benchmarks (separate sprint after BEAM-100K target). -- Gemini-judge cross-calibration (Anthropic-judge only per user requirement). -- Working memory / scratchpad (Phase 5 stretch only; not required for 0.85 target). -- BGE Small EN v1.5 embedding swap (not LLM-dependent; can ship if Phase 4 plateaus). -- Reflect storage compaction / TTL (defer to later sprint). - -## Constraints (from CLAUDE.md) - -- TypeScript ESM -- Files ≤ 400 lines (excluding comments). New modules designed for this. -- Functions ≤ 40 lines (excluding catch/finally). -- No `any`. -- No `process.env` reads outside `src/config.ts`. -- Mutations fail closed. -- Pre-commit: `npx tsc --noEmit`, `npm test`, `fallow --no-cache`. - -## Success criteria - -- **Required:** BEAM-100K composite ≥ 0.75 at 4-conv n=80 under Haiku × Haiku-judge. -- **Target:** BEAM-100K composite ≥ 0.85. -- **Stretch:** BEAM-100K composite ≥ 0.90 AND BEAM-1M / BEAM-10M tiers measured. -- **Constraint:** Pareto cost ≤ Hindsight's $0.075/q across all measured tiers. -- **Quality bar:** every claimed PASS reproducible at 4-conv n=80 (multirun if borderline). - -## Scope of the first implementation plan - -This spec covers all 5 phases as the strategic umbrella. The **first implementation plan** (produced by `superpowers:writing-plans`) will cover only **Phase 0 + Phase 1** — foundation cleanup + Reflect step. After Phase 1 ships and gates, we re-enter brainstorming → spec-update → next plan for Phase 2 specialists. This keeps each implementation plan tractable and lets us re-plan based on Phase 1's actual measured results. - -## Next step - -Transition to `superpowers:writing-plans` to produce the Phase 0 + Phase 1 TDD implementation plan with one task per step. diff --git a/docs/tbc-phase-3-schema.md b/docs/tbc-phase-3-schema.md deleted file mode 100644 index d936281..0000000 --- a/docs/tbc-phase-3-schema.md +++ /dev/null @@ -1,214 +0,0 @@ -# TBC Phase 3 — Schema Migration Design - -**Date:** 2026-05-06 -**Branch:** `worktree-tbc-prototype` -**Status:** design (T1.1 deliverable). Implementation in T1.2. - ---- - -## Goal - -Phase 2 wrote belief state to `memories.metadata` (JSONB). Phase 3 promotes belief state to first-class columns + a new typed-edge table, so search-time consumers can read normalized fields without parsing JSONB. - -**Migration is strictly additive.** Pre-migration databases stay queryable; tbc-execution.ts dual-writes during the migration window. - ---- - -## Schema additions - -### 1. New columns on `memories` - -```sql --- Confidence in [0,1]; default 1.0 means "fully believed" (matches AUDN's no-confidence-tracking baseline). -ALTER TABLE memories ADD COLUMN IF NOT EXISTS confidence REAL DEFAULT 1.0 - CHECK (confidence >= 0.0 AND confidence <= 1.0); - --- Belief tier — controls how the claim influences answer generation. --- standard: default tier, normal weight in retrieval --- directive: promoted; injected as a "must follow" rule in answer prompt --- demoted: challenged; lower weight + flagged for re-evaluation --- retracted: believed false; excluded from default retrieval -ALTER TABLE memories ADD COLUMN IF NOT EXISTS belief_tier TEXT DEFAULT 'standard' - CHECK (belief_tier IN ('standard', 'directive', 'demoted', 'retracted')); - --- The TBC operator that most recently mutated this memory. -ALTER TABLE memories ADD COLUMN IF NOT EXISTS mutation_type TEXT DEFAULT NULL - CHECK (mutation_type IS NULL OR mutation_type IN ( - 'AFFIRM', 'UPDATE', 'RETRACT', 'SUPERSEDE', - 'PROMOTE', 'DEMOTE', 'EVIDENCE_FOR', 'COUNTER' - )); -``` - -### 2. New table `belief_edges` - -```sql -CREATE TABLE IF NOT EXISTS belief_edges ( - id UUID PRIMARY KEY DEFAULT gen_random_uuid(), - user_id TEXT NOT NULL, - source_id UUID NOT NULL, - target_id UUID NOT NULL, - edge_type TEXT NOT NULL CHECK (edge_type IN ( - 'evidence_for', -- source supports target's confidence - 'counter', -- source contradicts target's confidence - 'supersedes', -- source replaces target (more specific or general) - 'promotes', -- source promoted target to directive tier - 'demotes' -- source challenged target without retracting - )), - weight REAL NOT NULL DEFAULT 0.0 - CHECK (weight >= -1.0 AND weight <= 1.0), - rationale TEXT NOT NULL DEFAULT '', - created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), - workspace_id UUID DEFAULT NULL, - agent_id UUID DEFAULT NULL -); -``` - -### 3. Indexes - -```sql --- For "all evidence pointing at this claim" queries (queryable belief state) -CREATE INDEX IF NOT EXISTS idx_belief_edges_target - ON belief_edges (target_id, edge_type, created_at DESC); - --- For "all claims this evidence supports/counters" queries -CREATE INDEX IF NOT EXISTS idx_belief_edges_source - ON belief_edges (source_id, edge_type); - --- User-scoped target traversal (multi-tenant safety) -CREATE INDEX IF NOT EXISTS idx_belief_edges_user_target - ON belief_edges (user_id, target_id); - --- Tier-aware retrieval (directives surface fast) -CREATE INDEX IF NOT EXISTS idx_memories_belief_tier - ON memories (user_id, belief_tier) - WHERE deleted_at IS NULL AND expired_at IS NULL AND belief_tier != 'standard'; - --- Confidence-weighted retrieval (low-confidence demotion) -CREATE INDEX IF NOT EXISTS idx_memories_confidence - ON memories (user_id, confidence DESC) - WHERE deleted_at IS NULL AND expired_at IS NULL; -``` - ---- - -## Migration semantics - -| Property | Value | -|---|---| -| Additive only | yes — no DROP, no destructive change | -| Backfill | implicit via DEFAULT clauses; existing rows: `confidence=1.0`, `belief_tier='standard'`, `mutation_type=NULL` | -| Rollback | `ALTER TABLE memories DROP COLUMN ...` + `DROP TABLE belief_edges`; no data loss in pre-existing columns | -| Dual-write window | tbc-execution.ts writes both `metadata.confidence/mutation_type` AND new columns | -| Read path | search/repository can read either; prefer columns when populated, fall back to metadata | - ---- - -## How tbc-execution.ts changes - -Phase 2 wrote everything into `memories.metadata`. Phase 3 changes the executor to **dual-write** during the migration window: - -| Operator | Phase 2 (metadata-only) | Phase 3 (dual-write) | -|---|---|---| -| Affirm | metadata.confidence += delta | + `UPDATE memories SET confidence = confidence + delta` | -| Update | mutation_type=UPDATE in metadata | + `UPDATE memories SET mutation_type='UPDATE'` | -| Retract | tier=retracted in metadata | + `UPDATE memories SET belief_tier='retracted', mutation_type='RETRACT'` | -| Supersede | revision_history append | + `INSERT INTO belief_edges (..., edge_type='supersedes')` | -| Promote | tier=directive in metadata | + `UPDATE memories SET belief_tier='directive'` + edge insert | -| Demote | tier=demoted, conf-= in metadata | + `UPDATE memories SET belief_tier='demoted', confidence=...` + edge insert | -| EvidenceFor | revision_history append | + `INSERT INTO belief_edges (..., edge_type='evidence_for')` | -| Counter | revision_history append | + `INSERT INTO belief_edges (..., edge_type='counter')` | - -The Phase 2 metadata writes stay **as a fallback** for pre-migration databases. After Phase 3 lands and migration is verified, a cleanup commit removes the metadata writes. - ---- - -## New repository: `belief-edges-repository.ts` - -API: -```ts -export interface BeliefEdge { - id: string; - source_id: string; - target_id: string; - edge_type: 'evidence_for' | 'counter' | 'supersedes' | 'promotes' | 'demotes'; - weight: number; - rationale: string; - created_at: Date; -} - -export async function appendEdge( - userId: string, - source: string, - target: string, - edge_type: BeliefEdge['edge_type'], - weight: number, - rationale: string, -): Promise; - -export async function getEdgesForTarget( - userId: string, - target_id: string, -): Promise; - -export async function aggregateConfidenceDelta( - userId: string, - target_id: string, -): Promise; // sum of weights for evidence_for - sum of weights for counter -``` - -Aggregation function `aggregateConfidenceDelta` is the bridge to a future "queryable belief state" search operator: given a claim, fold all evidence/counter edges into a current-confidence reading. - ---- - -## Behavioral guarantees (regression) - -When `TBC_ENABLED=false` (default): -- No new columns are read or written -- AUDN code path is byte-for-byte unchanged -- belief_edges table stays empty for that user -- 62 regression tests still pass - -When `TBC_ENABLED=true`: -- Existing memories' rows pre-migration: `confidence=1.0`, `belief_tier='standard'` — TBC reads default values, writes update them on next mutation -- New ingest goes through tbc-execution dual-write -- Search consumers can read either columns or metadata; prefer columns - ---- - -## Open questions for Phase 3 implementation - -1. **Confidence aggregation rule.** Two candidates fill the same claim slot, both get Affirm — does confidence update sum, max, or weighted average? Default proposal: weighted average by `weight` field, capped at 1.0. - -2. **Promote auto-eligibility.** Currently Promote is only LLM-triggered. Should the system auto-promote claims whose evidence-edge sum ≥ threshold (e.g., 3 EvidenceFor edges over time)? Defer to Phase 4. - -3. **Retract → directive tier interaction.** If a Promoted claim is later Retracted, what happens to dependent reasoning? Phase 3 just sets `belief_tier='retracted'`; downstream-edge invalidation deferred. - -4. **Belief_edges and workspace scoping.** Should an edge cross workspaces? Default: no; both source and target must be in the same workspace. - -5. **Pruning policy for belief_edges.** Without pruning, the edge table grows quadratically with conversation length. Phase 5+ adds a retention policy (e.g., compress edges older than N days into aggregate weights). - ---- - -## Migration execution plan (T1.2) - -1. Author migration as a separate file `src/db/migrations/2026-05-06-tbc-phase3.sql` (or similar — discover the project's migration convention) -2. Apply to local dev DB; run existing test suite to confirm no regression -3. Apply to test DB; run TBC unit tests against real schema -4. Document the rollback SQL alongside the forward migration -5. Update `tbc-execution.ts` to dual-write -6. Add `belief-edges-repository.ts` -7. Smoke test: ingest 5 facts, mutate via TBC, query belief_edges and confirm rows - ---- - -## Phase 4+ (post-Phase 3) - -Phase 4 wires belief-state queries into search: -- New search request field: `recall_belief_state(attribute, as_of?)` → returns current believed value with provenance -- Used to attack BEAM-100K KU/CR/ABS abilities specifically - -Phase 5 brings belief-state into hierarchical retrieval (T2 line of work): -- Session-summary embeddings filter to "high-confidence + non-retracted" claims at retrieval time -- Belief tier influences answer prompt (directives go first) - -The Phase 3 migration is the foundation; Phases 4+5 are paper-shape contributions. diff --git a/docs/typed-belief-calculus.md b/docs/typed-belief-calculus.md deleted file mode 100644 index 4f64712..0000000 --- a/docs/typed-belief-calculus.md +++ /dev/null @@ -1,244 +0,0 @@ -# Typed Belief Calculus (TBC) — Design Document - -**Status:** Phase 2 prototype (uncommitted, behind `TBC_ENABLED=false` by default) -**Owner:** AtomicMemory core -**Source rationale:** `atomicmemory-research/memory-research/landscape/2026-05-06-typed-belief-calculus-thinking.md` - -## 1. Why TBC - -Today AUDN reconciles every inbound atomic claim against existing memories -and emits one of `Add | Update | Delete | No-op | Supersede | Clarify`. This -is already finer-grained than any peer system in the 19-system landscape, but -it still treats updates as discrete state changes. Beliefs in agent memory -are continuous: evidence accumulates, contradicts, qualifies, generalizes. - -The Typed Belief Calculus (TBC) extends AUDN's decision space to **eight typed -operators**, each with explicit storage semantics. AUDN remains a strict subset -— every existing AUDN action maps to a TBC operator, and the rollout is gated -by a single flag so the prototype can ride alongside production without risk. - -## 2. The eight operators - -### Affirm -New evidence **supports** an existing claim. No new canonical fact is created. -The target claim's confidence is incremented and an evidence pointer is -recorded against its current version. This is the TBC analog of `NOOP` when -the candidate is genuinely a duplicate, but with the explicit signal that the -duplicate carries informational weight. - -### Update -A claim about the same attribute now holds a **different value** (e.g., "lives -in Boston" → "lives in Seattle"). Versioned supersession: the old version is -retained as historic state, the new version becomes current, and the -revision history records the operator that drove the change. This is the -direct heir of AUDN `UPDATE`. - -### Retract -The claim is now believed **false** with no replacement. Mark `RETRACTED` -rather than deleting the row, and preserve the original as evidence so -future agents can see "this was once asserted and was withdrawn." This is -finer-grained than AUDN `DELETE`: deletion erases, retraction is a typed -non-belief. - -### Supersede -Replaced by a more **specific or general** claim ("uses a Python web -framework" → "uses FastAPI"). Old and new are linked, both queryable. Maps -1-to-1 with AUDN `SUPERSEDE` but TBC additionally records direction -(specialization vs. generalization) for downstream query rewriting. - -### Promote -A claim has been **strong and repeated** enough to become a **directive** — -a constraint that influences answer assembly, not just one fact among many. -Promotion moves the claim into a "directive" tier and bumps its prompt -priority. This is genuinely new: AUDN has no analog. Phase 2 will define the -threshold (count of Affirms, confidence floor) and whether promotion is -implicit or explicit (see open questions). - -### Demote -The claim has been **challenged but not retracted** — fresh evidence is -inconsistent enough to lower confidence and flag the belief for -re-evaluation, but not enough to retract. Confidence drops; a "needs -re-evaluation" tag attaches; the claim remains queryable. Adds visibility -to the soft-conflict regime AUDN currently routes to `CLARIFY`. - -### EvidenceFor -Adds a **supports** edge from the inbound claim to a target claim's current -version. Does not introduce a new canonical fact and does not change the -target's content — only the evidence graph. Distinct from `Affirm` in that -the inbound text is itself novel (it stays as its own node) but its semantic -weight contributes to a different claim's confidence. - -### Counter -Adds a **contradicts** edge from the inbound claim to a target claim's -current version. Like `EvidenceFor`, this is a graph-only operator — -neither claim is mutated; the edge records the tension. Aggregating edges -is what eventually drives `Demote` or `Retract`. - -## 3. Schema additions (Phase 3 plan) - -Phase 1 is non-schema. The following are projected for Phase 3. - -### New columns -- `memories.confidence` (`real`, default `1.0`) — current belief strength. -- `memories.belief_tier` (`text`, default `'standard'`, candidate values - `'standard' | 'directive'`) — tier promoted via `Promote`. -- `claim_versions.mutation_type` extends to include the eight TBC operators - (current set: `add | update | supersede | delete | clarify`). - -### New table — `belief_edges` -``` -id uuid primary key -user_id uuid not null -source_id uuid not null -- inbound claim (memory id) -target_id uuid not null -- supported/contradicted claim (memory id) -edge_type text not null -- 'evidence_for' | 'counter' -weight real not null -- in [0, 1]; aggregated into confidence_delta -rationale text -created_at timestamptz default now() -``` - -This is the substrate that turns memory from a fact list into a graph. -Search and retrieval read it through aggregation views; ingest writes one -row per `EvidenceFor` / `Counter` decision. - -### New table — `belief_revision_history` (optional, Phase 3) -Normalized form of the in-metadata `revision_history` array — useful when -a claim accumulates more than a handful of revisions. Phase 1/2 keep the -list inline in `MemoryMetadata` because revision counts will be small. - -## 4. AUDN integration - -The integration seam is intentionally narrow. When `TBC_ENABLED=false` -(default) nothing in the AUDN path changes; the new types exist but are -never read. When `TBC_ENABLED=true`: - -1. `resolveAndExecuteAudn` (in `src/services/memory-audn.ts`) checks - `deps.config.tbcEnabled`. -2. If true, it calls `decideBeliefOperator(newClaim, candidates)` from - `src/services/typed-belief-calculus.ts` instead of `cachedResolveAUDN`. -3. The resulting `BeliefOperationDecision` is translated to either: - - an existing AUDN executor (Affirm → NOOP+evidence, Update → UPDATE, - Retract → DELETE, Supersede → SUPERSEDE), or - - a new TBC-only executor (Promote, Demote, EvidenceFor, Counter) - landing in Phase 2 alongside the LLM resolver. -4. The trace shape (`IngestFactTrace`) gains an optional `beliefOperator` - field so existing traces remain valid; this lands in Phase 2 with the - first executor. - -Critically, **the fast-path AUDN and deferred-AUDN routes remain -unchanged.** They continue to short-circuit before TBC is consulted; the -LLM call is the only step we rewire. - -## 5. Migration path - -Phase 1 (this PR) is **strictly additive and gated off**: -- New file `src/services/typed-belief-calculus.ts` (types + stub resolver) -- One config flag (`tbcEnabled`, default false) -- One IngestRuntimeConfig field -- This design doc - -No DB migration. No data migration. No production code path branches on -`tbcEnabled` yet; the flag exists so Phase 2's wiring stays a one-line -change. - -Phase 2 (LLM resolver) and Phase 3 (schema additions) are tracked -separately; both will be additive — existing rows without `confidence` -default to `1.0`, existing memories without a `mutation_type` default to -the AUDN-era value, and the `belief_edges` table is unreferenced when -the flag is off. There is no rollback hazard because there is no -destructive migration. - -## 6. Open questions - -1. **Confidence aggregation.** When several `EvidenceFor` edges fire over - time, how does the target's `confidence` move? Linear sum capped at - 1.0? Bayesian update with a fixed prior? Beta posterior parameterized - by edge count? Phase 2 needs to fix this; Phase 1 leaves it - unspecified because no aggregation runs yet. -2. **Promote: implicit vs. explicit.** Should `Promote` fire automatically - when an Affirm count crosses a threshold, or only when the LLM - resolver explicitly chooses it given a candidate's history? Implicit - is simpler; explicit gives the LLM a knob to tune directive strength - per-domain. -3. **Counter without a known target.** What does the resolver do with - an inbound claim that contradicts something we don't have? AUDN - today treats it as `ADD`; TBC could record a "challenge in waiting" - so a future ingest of the matching claim is auto-demoted. -4. **Revision-history bound.** The `BeliefMetadata.revision_history` - array is unbounded by design (audit). Phase 2 may need a rotation - policy for long-lived directive claims. -5. **Demote and search ranking.** Once `confidence` lands as a column, - should retrieval scoring weight by it? Almost certainly yes for - directive-tier claims; less obviously for `standard`. Phase 4 (search) - territory. - -## 7. Phase status - -| Phase | Scope | Status | -|---|---|---| -| 1 | Type surface, config flag, design doc | Done | -| 2 | LLM resolver, executors for the four new operators, trace extension | **In progress (this PR)** | -| 3 | DB migration: confidence column, belief_edges table, mutation_type expansion | Not started | -| 4 | Search integration: confidence-weighted ranking, directive-tier injection | Not started | -| 5 | Benchmark validation: BEAM CR/KU/ABS lift under TBC vs. AUDN baseline | Not started | - -## 8. Phase 2 status — wired vs. deferred - -**Wired in this PR:** - -- `decideBeliefOperator(newClaim, candidates, llm?)` is now an LLM-backed - resolver in `src/services/typed-belief-calculus.ts`. It builds a TBC - prompt around the inbound claim plus up to 3 conflict candidates with - their current belief state and demands a JSON response with - `{operator, target_claim_id?, confidence_delta, rationale}`. JSON parse - failures, transport failures, invalid operators, and out-of-set target - IDs all raise the typed `BeliefResolverError` — there is no silent - fallback to ADD. -- `resolveAndExecuteAudn` (in `src/services/memory-audn.ts`) now branches - on `deps.config.tbcEnabled` after fast-audn / deferred-audn short-circuit - and delegates to `resolveAndExecuteTbc` in - `src/services/tbc-execution.ts`. With the flag off, the file-byte-diff - inside `resolveAndExecuteAudn` is a single guarded `if`; nothing under - the AUDN code path changes. -- `tbc-execution.ts` translates each of the eight operators: - - **Affirm** → existing AUDN `NOOP` (records evidence on the existing - claim version). - - **Update** → existing AUDN `UPDATE` executor. - - **Retract** → existing AUDN `DELETE` executor. - - **Supersede** → existing AUDN `SUPERSEDE` executor. - - **Promote** → in-memory metadata write of `mutation_type=PROMOTE` plus - `directive: true` and a bumped `confidence`, with a new - `revision_history` entry. - - **Demote** → in-memory metadata write of `mutation_type=DEMOTE` plus - a lowered `confidence`, with a new `revision_history` entry. - - **EvidenceFor** → graph-only edge appended to `belief_edges` with a - positive `weight` derived from `confidence_delta`. - - **Counter** → graph-only edge appended to `belief_edges` with a - negative `weight`. - All four new operators write into the existing JSONB `metadata` column — - no DB migration in this phase. -- `IngestFactTrace` (and its inner `IngestTraceDecision`) gain an optional - `beliefOperator?: BeliefOperator` field plus eight new - `tbc-*` reason codes and a `'tbc'` decision source. AUDN traces remain - unchanged when the flag is off. -- Unit tests live at - `src/services/__tests__/typed-belief-calculus.test.ts` and cover the - resolver (eight operators, confidence clamping, three fail-closed - paths), the executor (each TBC-only operator, the AUDN-mappable - routing, the confidence-math sequence), and the flag-off regression. - -**Deferred to Phase 3 / 4:** - -- A real `belief_edges` table — Phase 2 stores edges in metadata only. - When the table lands, the Phase-2 `belief_edges` metadata blob becomes - a write-through cache to seed the new schema. -- A `confidence` column on `memories` and a `belief_tier` enum. Today - confidence and tier live in metadata; query-time reads default to - `1.0` / `'standard'` until the column exists. -- Reading hydrated `BeliefMetadata` into the resolver's prompt. The Phase-2 - prompt currently reports `confidence: 1.0` and `mutation_type: NONE` - for every candidate; Phase 4 wires the loaded state through. -- Search-side use of TBC state — confidence-weighted scoring and - directive-tier injection are Phase 4 (search integration) territory. -- Aggregating accumulated `belief_edges` into a confidence drift signal. - Phase 2 records the edges; Phase 3/4 read them back. diff --git a/package-lock.json b/package-lock.json index 9a340d3..73dedb7 100644 --- a/package-lock.json +++ b/package-lock.json @@ -1,12 +1,12 @@ { "name": "@atomicmemory/core", - "version": "1.0.0", + "version": "1.0.1", "lockfileVersion": 3, "requires": true, "packages": { "": { "name": "@atomicmemory/core", - "version": "1.0.0", + "version": "1.0.1", "license": "Apache-2.0", "dependencies": { "@anthropic-ai/claude-agent-sdk": "^0.2.140", diff --git a/package.json b/package.json index 0fb4058..b146aae 100644 --- a/package.json +++ b/package.json @@ -1,6 +1,6 @@ { "name": "@atomicmemory/core", - "version": "1.0.0", + "version": "1.0.1", "description": "Open-source memory engine for AI applications — semantic retrieval, AUDN mutation, and contradiction-safe claim versioning.", "type": "module", "license": "Apache-2.0", diff --git a/src/config.ts b/src/config.ts index ca637bc..fdc6b56 100644 --- a/src/config.ts +++ b/src/config.ts @@ -191,7 +191,7 @@ export interface RuntimeConfig { * Typed Belief Calculus (TBC) gate. When true, the AUDN decision step * defers to `decideBeliefOperator` from `services/typed-belief-calculus.ts`. * Default false — Phase 1 ships only the type surface and stub resolver, - * so existing AUDN behavior is unchanged. See `docs/typed-belief-calculus.md`. + * so existing AUDN behavior is unchanged. */ tbcEnabled: boolean; /** @@ -199,7 +199,7 @@ export interface RuntimeConfig { * conversation/session summaries first, then expands to atomic facts within * the matched sessions. Targets BEAM-10M scale (~14M tokens of context per * system) where flat top-K retrieval loses signal. - * Default false. See `docs/hierarchical-retrieval.md`. + * Default false. * Env var: HIERARCHICAL_RETRIEVAL_ENABLED=true */ hierarchicalRetrievalEnabled: boolean; diff --git a/src/db/belief-edges-repository.ts b/src/db/belief-edges-repository.ts index c551f63..5f5a28c 100644 --- a/src/db/belief-edges-repository.ts +++ b/src/db/belief-edges-repository.ts @@ -4,7 +4,7 @@ * Promote / Demote operators of the typed belief calculus. * * Schema lives in src/db/schema.sql under "TBC Phase 3" section. - * Activated only when `TBC_ENABLED=true`; see docs/typed-belief-calculus.md. + * Activated only when `TBC_ENABLED=true`. */ import pg from 'pg'; diff --git a/src/db/schema.sql b/src/db/schema.sql index dadb242..029f549 100644 --- a/src/db/schema.sql +++ b/src/db/schema.sql @@ -520,7 +520,7 @@ CREATE INDEX IF NOT EXISTS idx_memory_foresight_workspace -- Promotes belief state from `memories.metadata` JSONB into typed columns + -- a new `belief_edges` table. All additions are idempotent (IF NOT EXISTS). -- Pre-migration databases stay queryable; tbc-execution.ts dual-writes --- during the migration window. Design doc: docs/tbc-phase-3-schema.md. +-- during the migration window. -- Activated only when TBC_ENABLED=true; defaults preserve existing behavior. -- --------------------------------------------------------------------------- @@ -593,7 +593,7 @@ CREATE INDEX IF NOT EXISTS idx_belief_edges_user_target -- BEAM-10M scale (10 conversations × ~1.4M tokens each = ~14M total context). -- session_summaries + conv_summaries indexed via HNSW on summary_embedding. -- Activated only when HIERARCHICAL_RETRIEVAL_ENABLED=true; defaults preserve --- existing flat-retrieval behavior. Design doc: docs/hierarchical-retrieval.md. +-- existing flat-retrieval behavior. -- --------------------------------------------------------------------------- CREATE TABLE IF NOT EXISTS session_summaries ( diff --git a/src/db/summaries-repository.ts b/src/db/summaries-repository.ts index d70fa16..2645cbc 100644 --- a/src/db/summaries-repository.ts +++ b/src/db/summaries-repository.ts @@ -1,8 +1,7 @@ /** * Repository for hierarchical-retrieval session + conversation summaries. * Schema lives in src/db/schema.sql under "Hierarchical Retrieval" section. - * Activated only when `HIERARCHICAL_RETRIEVAL_ENABLED=true`; see - * docs/hierarchical-retrieval.md. + * Activated only when `HIERARCHICAL_RETRIEVAL_ENABLED=true`. * * Reads use pgvector cosine distance (`embedding <=> $1`) returning * `1 - distance` as similarity. The `pgvector` package converts JS diff --git a/src/services/memory-service-types.ts b/src/services/memory-service-types.ts index 64ea18a..c867422 100644 --- a/src/services/memory-service-types.ts +++ b/src/services/memory-service-types.ts @@ -371,7 +371,6 @@ export interface IngestRuntimeConfig { * Hierarchical retrieval gate. When true, ingest generates session + * conversation summaries (session-summary-generator.ts); search adds a 5th * RRF arm over those summaries. Default false — no runtime effect today. - * See `docs/hierarchical-retrieval.md`. */ hierarchicalRetrievalEnabled: boolean; /** diff --git a/src/services/retrieval-policy.ts b/src/services/retrieval-policy.ts index 5489ee0..914b99c 100644 --- a/src/services/retrieval-policy.ts +++ b/src/services/retrieval-policy.ts @@ -75,7 +75,7 @@ const RECALL_BYPASS_REASONS = { * * Validated 2026-04-01: 0/15 false positives across 2,173 benchmark queries * (7 datasets). 4 borderline date-pinned queries are harmless (extra depth, - * no accuracy impact). See: docs/.../current-marker-fp-analysis-2026-04-01.md + * no accuracy impact). * * If editing this list, re-run the FP scan: * classifyQueryDetailed() against all eval dataset queries. diff --git a/src/services/typed-belief-calculus.ts b/src/services/typed-belief-calculus.ts index cdbd48d..ac0045b 100644 --- a/src/services/typed-belief-calculus.ts +++ b/src/services/typed-belief-calculus.ts @@ -9,8 +9,7 @@ * Phase 2 (this revision) wires `decideBeliefOperator` to a real LLM call * and lets `memory-audn.ts` route through it when `RuntimeConfig.tbcEnabled` * is true. Schema is unchanged — TBC mutations write to existing JSONB - * metadata only. See `tbc-execution.ts` for the executor and - * `docs/typed-belief-calculus.md` for design rationale. + * metadata only. See `tbc-execution.ts` for the executor. */ import type { ChatMessage, LLMProvider } from './llm.js';