EvoGraph: Less is More for LLM Long-Term Conversational Memory

Rethinking Memory Architectures for Large Language Models

EvoGraph is a minimalist graph-based conversational memory system that challenges the prevailing complexity assumption in LLM memory design. By using a simple entity-centric architecture with raw context preservation, EvoGraph outperforms 13 state-of-the-art systems on the LoCoMo benchmark — including SYNAPSE, Zep, MemoryOS, and A-Mem — while using significantly less code.

Key Results

Evaluated on LoCoMo-10 (1,540 queries across 10 long-term conversations, GPT-4o-mini, temp=0):

Method	Category	F1 (%)	BLEU-1 (%)
EvoGraph (Ours)	Graph-based	44.82	35.88
SYNAPSE (Jiang+26)	Graph-based	40.50	32.60
Zep (Rasmussen+25)	Graph-based	39.70	31.20
MemoryOS (Li+25)	System-level	38.00	29.10
LangMem	System-level	34.30	25.70
AriGraph (Anokhin+25)	Graph-based	33.70	26.20
A-Mem (Xu+25)	Graph-based	33.30	26.20
MemGPT (Packer+24)	System-level	28.00	20.50
LoCoMo (Maharana+24)	Agentic	25.60	19.90

Baseline results sourced from [Jiang et al., 2026] for fair comparison on the same data split.

Performance by Query Type

Query Type	EvoGraph	SYNAPSE	Delta
Multi-Hop (54.9%)	51.39%	35.70%	+15.69 pp (+44.0%)
Temporal (20.9%)	46.21%	50.10%	-3.89 pp
Single-Hop (18.4%)	31.55%	48.90%	-17.35 pp
Open-Domain (5.8%)	14.55%	25.90%	N/A

EvoGraph dominates multi-hop reasoning — the hardest and most prevalent query type (54.9% of benchmark) — by a +44.0% relative margin.

Core Idea

Most existing conversational memory systems (SYNAPSE, Zep, MemoryOS, etc.) adopt increasingly complex designs: multi-layer graph structures, spreading activation, diffusion mechanisms, OS-level memory management. The implicit assumption is that more structure = better performance.

We challenge this assumption. Our key insight:

In the era of powerful LLMs, the memory system's job is retrieval, not reasoning. Give the LLM the right context in its native format, and let it reason.

Complex systems:  Multi-layer graphs + Spreading activation + Structured output → LLM
EvoGraph:         Entity index + Co-occurrence retrieval + Raw dialogue text    → LLM

Three Design Principles (validated by ablation)

Principle	Validated By	Result
Quality > Quantity	10 entities vs 61 entities	+29.78 pp F1
Raw > Structured	Raw dialogue vs Structured timeline	+0.99 pp F1 (single-hop: +3.64 pp, p=0.013)
Entity-Centric Sufficiency	Entity-centric vs Relation traversal	Comparable (no significant difference)

Architecture

┌─────────────────────────────────────────────────────────┐
│                    EvoGraph Pipeline                     │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  ┌──────────────┐    ┌───────────────┐    ┌──────────┐ │
│  │ Conversation  │───▶│    Entity     │───▶│  Neo4j   │ │
│  │   History     │    │  Extraction   │    │  Graph   │ │
│  │              │    │  (10 entities  │    │          │ │
│  │  [Turn 1]    │    │   per note)   │    │ Entity   │ │
│  │  [Turn 2]    │    │              │    │   ↕      │ │
│  │    ...       │    │  LLM-based   │    │  Note    │ │
│  │  [Turn N]    │    │  CRUD ops    │    │          │ │
│  └──────────────┘    └───────────────┘    └────┬─────┘ │
│                                                │       │
│  ┌──────────────┐    ┌───────────────┐    ┌────▼─────┐ │
│  │              │◀───│   Embedding   │◀───│ Entity   │ │
│  │   Answer     │    │   Reranking   │    │ Matching │ │
│  │  Generation  │    │              │    │          │ │
│  │              │    │ Cosine sim.  │    │ Query →  │ │
│  │  LLM + raw   │    │ Top-k notes  │    │ Entities │ │
│  │  dialogue    │    │              │    │          │ │
│  └──────────────┘    └───────────────┘    └──────────┘ │
│                                                         │
└─────────────────────────────────────────────────────────┘

Graph Structure (intentionally minimal):

(Entity) ──[:MENTIONED_IN {turn_id}]──▶ (Note)
(Note)   ──[:NEXT]──▶                   (Note)

No multi-layer hierarchies. No spreading activation. No state evolution chains. Just entities linked to the conversations that mention them.

Ablation Studies

Four systematic ablation experiments validate each design choice:

1. Reranking Strategy

Method	F1 (%)	Delta
Embedding Rerank (Ours)	44.82	baseline
LLM Rerank (GPT-4o-mini)	47.77	+2.95 (upper bound)
BM25 Rerank	43.10	-1.72
No Rerank	22.94	-21.88

Embedding-based reranking balances performance and reproducibility. LLM reranking provides further gains but introduces fairness concerns for comparison.

2. Entity Extraction Strategy

Strategy	# Entities	F1 (%)	Delta
Quality-focused (Ours)	10	44.82	baseline
Quantity-focused	61	17.00	-27.82 (p<0.001)

Over-extraction introduces catastrophic noise.

3. Information Presentation

Format	F1 (%)	Delta
Raw dialogue (Ours)	47.77	baseline
Structured timeline (LineNode)	46.78	-0.99

Raw context preserves the nuance that LLMs need for reasoning. Single-hop queries show a significant improvement with raw format (+3.64 pp, p=0.013).

4. Retrieval Strategy

Strategy	Recall (%)	Delta
Entity-centric (Ours)	70.30	baseline
+ Relation traversal	66.72	-3.58

Entity co-occurrence provides sufficient retrieval signal without explicit relation edges.

Quick Start

Prerequisites

Python 3.9+
Neo4j 4.x+ (Docker recommended)
An OpenAI-compatible LLM API endpoint

Installation

git clone https://github.com/YOUR_USERNAME/EvoGraph.git
cd EvoGraph
pip install -r requirements.txt

Configuration

Copy the example config and fill in your credentials:

cp config.example.py config.py
# Edit config.py with your LLM API key and Neo4j credentials

Build the Knowledge Graph

python import_data.py

Run Evaluation

python evaluate.py

Use as a Library

from evograph_nolinenode import EvoGraphNoLineNode

evo = EvoGraphNoLineNode()

result = evo.answer("When did Jon lose his job as a banker?")
print(result["answer"])
# >>> "February, 2023"

evo.close()

Project Structure

├── evograph_nolinenode.py       # Main interface — retrieve + answer
├── graph_retriever.py           # Two-layer retrieval + embedding rerank
├── knowledge_graph_nolinenode.py # Neo4j graph operations (Entity ↔ Note)
├── entity_extractor.py          # LLM-based entity extraction (CRUD)
├── embedding_manager.py         # Sentence-Transformers wrapper
├── kg_builder_nolinenode.py     # Graph construction pipeline
├── llm_client.py                # LLM API client with load balancing
├── config.py                    # Configuration (LLM, Neo4j, thresholds)
├── import_data.py               # Data import entry point
├── evaluate.py                  # Evaluation harness (F1, BLEU, Recall)
└── eval_results/                # Benchmark results

Why It Works: Theoretical Framework

1. Information Bottleneck

Structured representations are lossy compression. Converting "I moved to New Haven last week because my new job is there" into {entity: "James", location: "New Haven", action: "moved"} loses the causal reasoning chain. LLMs need complete context.

2. Pre-training Distribution Alignment

LLMs are trained on natural text. Feeding them natural dialogue activates their strongest reasoning capabilities. Structured formats (JSON, timelines, graph tuples) are out-of-distribution.

3. Cognitive Load

Explicit structure (e.g., Timeline: James → Hartford (LIVES_IN) → New Haven (LIVES_IN, UPDATE)) forces the LLM to first parse the format, then reason. Raw text eliminates this overhead.

Benchmark

We evaluate on LoCoMo (Long-term Conversational Memory), which tests whether systems can answer questions requiring information scattered across hundreds of conversation turns. LoCoMo-10 contains 10 conversations with 1,540 question-answer pairs across four categories:

Multi-hop (54.9%): Requires reasoning across multiple conversation turns
Temporal (20.9%): Requires temporal reasoning about events
Single-hop (18.4%): Simple fact lookup from a single turn
Open-domain (5.8%): General knowledge grounded in conversation

Complexity Comparison

System	Graph Structure	Retrieval	Lines of Code	F1 (%)
EvoGraph	Single-layer (Entity-Note)	Entity matching	~1,500	44.82
SYNAPSE	Three-layer (perception/semantic/event)	Spreading activation + Lateral inhibition	~5,000	40.50
Zep	Temporal knowledge graph	Temporal traversal	~3,000	39.70

Citation

@misc{evograph2026,
  title={EvoGraph: Less is More for LLM Long-Term Conversational Memory},
  year={2026},
  note={Under review}
}

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
eval_results		eval_results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
compare_eval.py		compare_eval.py
compare_results.py		compare_results.py
config.example.py		config.example.py
embedding_manager.py		embedding_manager.py
entity_extractor.py		entity_extractor.py
evaluate.py		evaluate.py
evaluate_baseline.py		evaluate_baseline.py
evograph.py		evograph.py
evograph_nolinenode.py		evograph_nolinenode.py
graph_retriever.py		graph_retriever.py
graph_retriever_nolinenode.py		graph_retriever_nolinenode.py
import_data.py		import_data.py
kg_builder.py		kg_builder.py
kg_builder_nolinenode.py		kg_builder_nolinenode.py
knowledge_graph.py		knowledge_graph.py
knowledge_graph_nolinenode.py		knowledge_graph_nolinenode.py
llm_client.py		llm_client.py
requirements.txt		requirements.txt
run_all_evals.py		run_all_evals.py
run_all_splits.py		run_all_splits.py
run_all_splits.sh		run_all_splits.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EvoGraph: Less is More for LLM Long-Term Conversational Memory

Key Results

Performance by Query Type

Core Idea

Three Design Principles (validated by ablation)

Architecture

Ablation Studies

1. Reranking Strategy

2. Entity Extraction Strategy

3. Information Presentation

4. Retrieval Strategy

Quick Start

Prerequisites

Installation

Configuration

Build the Knowledge Graph

Run Evaluation

Use as a Library

Project Structure

Why It Works: Theoretical Framework

1. Information Bottleneck

2. Pre-training Distribution Alignment

3. Cognitive Load

Benchmark

Complexity Comparison

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EvoGraph: Less is More for LLM Long-Term Conversational Memory

Key Results

Performance by Query Type

Core Idea

Three Design Principles (validated by ablation)

Architecture

Ablation Studies

1. Reranking Strategy

2. Entity Extraction Strategy

3. Information Presentation

4. Retrieval Strategy

Quick Start

Prerequisites

Installation

Configuration

Build the Knowledge Graph

Run Evaluation

Use as a Library

Project Structure

Why It Works: Theoretical Framework

1. Information Bottleneck

2. Pre-training Distribution Alignment

3. Cognitive Load

Benchmark

Complexity Comparison

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages