Rethinking Memory Architectures for Large Language Models
EvoGraph is a minimalist graph-based conversational memory system that challenges the prevailing complexity assumption in LLM memory design. By using a simple entity-centric architecture with raw context preservation, EvoGraph outperforms 13 state-of-the-art systems on the LoCoMo benchmark — including SYNAPSE, Zep, MemoryOS, and A-Mem — while using significantly less code.
Evaluated on LoCoMo-10 (1,540 queries across 10 long-term conversations, GPT-4o-mini, temp=0):
| Method | Category | F1 (%) | BLEU-1 (%) |
|---|---|---|---|
| EvoGraph (Ours) | Graph-based | 44.82 | 35.88 |
| SYNAPSE (Jiang+26) | Graph-based | 40.50 | 32.60 |
| Zep (Rasmussen+25) | Graph-based | 39.70 | 31.20 |
| MemoryOS (Li+25) | System-level | 38.00 | 29.10 |
| LangMem | System-level | 34.30 | 25.70 |
| AriGraph (Anokhin+25) | Graph-based | 33.70 | 26.20 |
| A-Mem (Xu+25) | Graph-based | 33.30 | 26.20 |
| MemGPT (Packer+24) | System-level | 28.00 | 20.50 |
| LoCoMo (Maharana+24) | Agentic | 25.60 | 19.90 |
Baseline results sourced from [Jiang et al., 2026] for fair comparison on the same data split.
| Query Type | EvoGraph | SYNAPSE | Delta |
|---|---|---|---|
| Multi-Hop (54.9%) | 51.39% | 35.70% | +15.69 pp (+44.0%) |
| Temporal (20.9%) | 46.21% | 50.10% | -3.89 pp |
| Single-Hop (18.4%) | 31.55% | 48.90% | -17.35 pp |
| Open-Domain (5.8%) | 14.55% | 25.90% | N/A |
EvoGraph dominates multi-hop reasoning — the hardest and most prevalent query type (54.9% of benchmark) — by a +44.0% relative margin.
Most existing conversational memory systems (SYNAPSE, Zep, MemoryOS, etc.) adopt increasingly complex designs: multi-layer graph structures, spreading activation, diffusion mechanisms, OS-level memory management. The implicit assumption is that more structure = better performance.
We challenge this assumption. Our key insight:
In the era of powerful LLMs, the memory system's job is retrieval, not reasoning. Give the LLM the right context in its native format, and let it reason.
Complex systems: Multi-layer graphs + Spreading activation + Structured output → LLM
EvoGraph: Entity index + Co-occurrence retrieval + Raw dialogue text → LLM
| Principle | Validated By | Result |
|---|---|---|
| Quality > Quantity | 10 entities vs 61 entities | +29.78 pp F1 |
| Raw > Structured | Raw dialogue vs Structured timeline | +0.99 pp F1 (single-hop: +3.64 pp, p=0.013) |
| Entity-Centric Sufficiency | Entity-centric vs Relation traversal | Comparable (no significant difference) |
┌─────────────────────────────────────────────────────────┐
│ EvoGraph Pipeline │
├─────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌───────────────┐ ┌──────────┐ │
│ │ Conversation │───▶│ Entity │───▶│ Neo4j │ │
│ │ History │ │ Extraction │ │ Graph │ │
│ │ │ │ (10 entities │ │ │ │
│ │ [Turn 1] │ │ per note) │ │ Entity │ │
│ │ [Turn 2] │ │ │ │ ↕ │ │
│ │ ... │ │ LLM-based │ │ Note │ │
│ │ [Turn N] │ │ CRUD ops │ │ │ │
│ └──────────────┘ └───────────────┘ └────┬─────┘ │
│ │ │
│ ┌──────────────┐ ┌───────────────┐ ┌────▼─────┐ │
│ │ │◀───│ Embedding │◀───│ Entity │ │
│ │ Answer │ │ Reranking │ │ Matching │ │
│ │ Generation │ │ │ │ │ │
│ │ │ │ Cosine sim. │ │ Query → │ │
│ │ LLM + raw │ │ Top-k notes │ │ Entities │ │
│ │ dialogue │ │ │ │ │ │
│ └──────────────┘ └───────────────┘ └──────────┘ │
│ │
└─────────────────────────────────────────────────────────┘
Graph Structure (intentionally minimal):
(Entity) ──[:MENTIONED_IN {turn_id}]──▶ (Note)
(Note) ──[:NEXT]──▶ (Note)
No multi-layer hierarchies. No spreading activation. No state evolution chains. Just entities linked to the conversations that mention them.
Four systematic ablation experiments validate each design choice:
| Method | F1 (%) | Delta |
|---|---|---|
| Embedding Rerank (Ours) | 44.82 | baseline |
| LLM Rerank (GPT-4o-mini) | 47.77 | +2.95 (upper bound) |
| BM25 Rerank | 43.10 | -1.72 |
| No Rerank | 22.94 | -21.88 |
Embedding-based reranking balances performance and reproducibility. LLM reranking provides further gains but introduces fairness concerns for comparison.
| Strategy | # Entities | F1 (%) | Delta |
|---|---|---|---|
| Quality-focused (Ours) | 10 | 44.82 | baseline |
| Quantity-focused | 61 | 17.00 | -27.82 (p<0.001) |
Over-extraction introduces catastrophic noise.
| Format | F1 (%) | Delta |
|---|---|---|
| Raw dialogue (Ours) | 47.77 | baseline |
| Structured timeline (LineNode) | 46.78 | -0.99 |
Raw context preserves the nuance that LLMs need for reasoning. Single-hop queries show a significant improvement with raw format (+3.64 pp, p=0.013).
| Strategy | Recall (%) | Delta |
|---|---|---|
| Entity-centric (Ours) | 70.30 | baseline |
| + Relation traversal | 66.72 | -3.58 |
Entity co-occurrence provides sufficient retrieval signal without explicit relation edges.
- Python 3.9+
- Neo4j 4.x+ (Docker recommended)
- An OpenAI-compatible LLM API endpoint
git clone https://github.com/YOUR_USERNAME/EvoGraph.git
cd EvoGraph
pip install -r requirements.txtCopy the example config and fill in your credentials:
cp config.example.py config.py
# Edit config.py with your LLM API key and Neo4j credentialspython import_data.pypython evaluate.pyfrom evograph_nolinenode import EvoGraphNoLineNode
evo = EvoGraphNoLineNode()
result = evo.answer("When did Jon lose his job as a banker?")
print(result["answer"])
# >>> "February, 2023"
evo.close()├── evograph_nolinenode.py # Main interface — retrieve + answer
├── graph_retriever.py # Two-layer retrieval + embedding rerank
├── knowledge_graph_nolinenode.py # Neo4j graph operations (Entity ↔ Note)
├── entity_extractor.py # LLM-based entity extraction (CRUD)
├── embedding_manager.py # Sentence-Transformers wrapper
├── kg_builder_nolinenode.py # Graph construction pipeline
├── llm_client.py # LLM API client with load balancing
├── config.py # Configuration (LLM, Neo4j, thresholds)
├── import_data.py # Data import entry point
├── evaluate.py # Evaluation harness (F1, BLEU, Recall)
└── eval_results/ # Benchmark results
Structured representations are lossy compression. Converting "I moved to New Haven last week because my new job is there" into {entity: "James", location: "New Haven", action: "moved"} loses the causal reasoning chain. LLMs need complete context.
LLMs are trained on natural text. Feeding them natural dialogue activates their strongest reasoning capabilities. Structured formats (JSON, timelines, graph tuples) are out-of-distribution.
Explicit structure (e.g., Timeline: James → Hartford (LIVES_IN) → New Haven (LIVES_IN, UPDATE)) forces the LLM to first parse the format, then reason. Raw text eliminates this overhead.
We evaluate on LoCoMo (Long-term Conversational Memory), which tests whether systems can answer questions requiring information scattered across hundreds of conversation turns. LoCoMo-10 contains 10 conversations with 1,540 question-answer pairs across four categories:
- Multi-hop (54.9%): Requires reasoning across multiple conversation turns
- Temporal (20.9%): Requires temporal reasoning about events
- Single-hop (18.4%): Simple fact lookup from a single turn
- Open-domain (5.8%): General knowledge grounded in conversation
| System | Graph Structure | Retrieval | Lines of Code | F1 (%) |
|---|---|---|---|---|
| EvoGraph | Single-layer (Entity-Note) | Entity matching | ~1,500 | 44.82 |
| SYNAPSE | Three-layer (perception/semantic/event) | Spreading activation + Lateral inhibition | ~5,000 | 40.50 |
| Zep | Temporal knowledge graph | Temporal traversal | ~3,000 | 39.70 |
@misc{evograph2026,
title={EvoGraph: Less is More for LLM Long-Term Conversational Memory},
year={2026},
note={Under review}
}MIT License