Skip to content

Entity unification: subjects, triples, supersession, labels, Hebbian all doing entity work independently #7

@Liorrr

Description

@Liorrr

Problem

ShrimPK has 9 separate mechanisms that independently approximate entity identification without a unifying layer:

  1. subject field on MemoryEntry (core/memory.rs:234) — extracted per child fact
  2. Entity labels (entity:* prefix in labels.rs) — keyword-based, not linked to entities
  3. Triples (subject, predicate, object) — stored per memory, disconnected from other entity refs
  4. fix_degenerate_subject() (consolidation.rs:700) — 100-line heuristic to repair "the user"/"I"/"me"
  5. child_topic_matches_query() (echo.rs:3008) — label/subject gate, fallback-based
  6. Subject diversity cap (echo.rs:3283) — per-(subject, topic) result limiting
  7. subjects_overlap() (consolidation.rs:1225) — case-insensitive subject comparison for supersession
  8. detect_relationship() (consolidation.rs:897) — regex relationship type extraction
  9. Hebbian co-activation (hebbian.rs) — typed edges (WorksAt, LivesIn, etc.) but not entity-anchored

These mechanisms don't talk to each other. "Sam", "Sam Torres", "the user", and "I" are treated as different subjects across different mechanisms. No centralized entity registry exists.

Impact

  • Supersession fails when subject heuristics disagree (e.g., "the user" vs "Sam" in different facts)
  • No entity profiles — can't answer "tell me everything about Sam" without full embedding scan
  • Contradiction detection impossible — can't check for conflicting facts about the same entity without knowing they're the same entity
  • GDPR deletion incomplete — can't find all traces of an entity across subjects, triples, labels, and Hebbian edges
  • 6 roadmap features blocked: contradiction detection, tombstone propagation, faithfulness scoring, collaborative memory, memory-type routing, entity-centric retrieval

Proposed Solution: EntityFrame

A lightweight entity registry (EntityFrame) that absorbs existing mechanisms rather than adding another layer:

  • Replaces: fix_degenerate_subject() (~100 lines), subjects_overlap() (~30 lines), store entity_index
  • Absorbs: subject field becomes entity_id: Option<EntityId>, entity: labels generated from frame, triple subjects resolve to EntityId
  • Untouched: Hebbian graph structure, non-entity labels, scoring pipeline, confidence/quality gates

Net complexity: ~+250 new lines, ~-300 removed heuristics. The system gets simpler.

Store-time detection: Aho-Corasick over known aliases (zero LLM calls). Entity-less memories (pure events, moods) get no entity assignment — not force-fitted.

Design Brief

Full design brief with EntityFrame structure, open questions, and roadmap impact at:
Obsidian Vault/ShrimPK Kernel/Design Brief — Entity Identification.md

Related Research

  • Zep/Graphiti: NER + LLM entity extraction + temporal validity per entity edge
  • Mem0: User profile building via fact extraction
  • MIRIX (arXiv 2507.07957): "core biography" memory type per entity
  • A-MEM (NeurIPS 2025): Zettelkasten notes with retroactive link updates on entity changes

Labels

  • enhancement
  • architecture
  • consolidation

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions