Skip to content

KU-3, TR-2, PT-3: Child facts exist but outranked by high-activation organic entries #5

@Liorrr

Description

@Liorrr

Problem

After KS69-KS71 (consolidation redesign + Tier 2), 3 benchmark cases consistently fail in both embedding-only and consolidation modes. The correct child facts ARE extracted and stored, but they don't score high enough to rank in top-5 results.

All 3 share the same root cause: embedding similarity gap — the child's embedding is too distant from the query, and high-activation organic entries dominate.

Failing Cases

KU-3: "What IDE does Sam use?"

TR-2: "Where has Sam traveled recently?"

PT-3: "What language is Sam learning?"

  • Expected: Japanese/JLPT in top-3
  • Child exists: "I practiced my Japanese — I'm at JLPT N3 level" (subject: "Japanese")
  • Child rank: Not in top-5 at all
  • Blocking entries: Programming language memories (Rust, Go, Python) dominate because "language" is ambiguous between natural and programming
  • Gap: BGE-small-EN-v1.5 doesn't distinguish "natural language learning" from "programming language preference" well enough

What's been tried (KS68-KS71)

  • Label topic boost (+0.06 for topic:tools:editor) — helps in seeded benchmark but LLM doesn't produce this label
  • Supersession demotion (-0.15) — works for KU-1 (Shopify→Stripe) but gap is too small for KU-3
  • Self-contained proposition extraction (KS69 prompt v3) — facts are good quality, problem is ranking not extraction
  • Subject fix (KS71 P0) — subjects now "Neovim"/"Tokyo"/"Japanese" not "the user"
  • Quality gate (KS71 P1) — filters fragments, doesn't affect ranking
  • Soft invalidation (KS71 P3) — 0.5x demotion on superseded children, helps but not enough

Potential fix directions (not yet implemented)

A. HyDE query expansion

Ask an LLM to generate a hypothetical answer before embedding the query. "What IDE does Sam use?" → "Sam uses Neovim as his editor" → embed that instead. The expanded query embeds much closer to the child fact. EchoConfig already has a hyde_enabled field stub.

B. Importance boost for superseding children

When a child's parent supersedes another memory, boost the child's importance score. "Sam uses Neovim" (child of superseding parent) gets +0.1 importance when the VS Code→Neovim edge exists.

C. Stronger label classification for LLM-extracted children

Currently children inherit parent labels (Tier 1 keyword). Add Tier 2 label enrichment specifically for children — classify "Sam uses Neovim" with topic:tools:editor so label_topic_boost fires.

D. Embedding model upgrade

BGE-small-EN-v1.5 (384-dim) conflates "language" (natural vs programming). A larger model (e.g., BGE-base or E5-large) may separate these better. Trade-off: latency + memory.

E. Query-type disambiguation

Detect "learning" in query → boost action:learning labeled entries. Detect "IDE"/"editor" → boost topic:tools:editor. Already partially exists in label_topic_boost but needs the child to carry the right label.

Benchmark context

  • Embedding-only: 16-17/20 (KU-3, TR-2, PT-3 always fail)
  • Consolidation (qwen2.5:1.5b): 17/20 (same 3 fail)
  • Seeded (deterministic children): 20/20 (all pass with hand-crafted labels + confidence)
  • The 3-case gap between seeded and consolidation is entirely this issue

Files

  • crates/shrimpk-memory/src/echo.rs — scoring pipeline, label_topic_boost, importance_boost
  • crates/shrimpk-memory/src/labels.rs — query classification, Tier 1/2 labels
  • crates/shrimpk-memory/src/consolidation.rs — child creation, label inheritance
  • tests/echo_micro_benchmark.rs — benchmark definitions

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions