Skip to content

AnnaSuSu/EvoGraph

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EvoGraph: Less is More for LLM Long-Term Conversational Memory

Rethinking Memory Architectures for Large Language Models

EvoGraph is a minimalist graph-based conversational memory system that challenges the prevailing complexity assumption in LLM memory design. By using a simple entity-centric architecture with raw context preservation, EvoGraph outperforms 13 state-of-the-art systems on the LoCoMo benchmark — including SYNAPSE, Zep, MemoryOS, and A-Mem — while using significantly less code.


Key Results

Evaluated on LoCoMo-10 (1,540 queries across 10 long-term conversations, GPT-4o-mini, temp=0):

Method Category F1 (%) BLEU-1 (%)
EvoGraph (Ours) Graph-based 44.82 35.88
SYNAPSE (Jiang+26) Graph-based 40.50 32.60
Zep (Rasmussen+25) Graph-based 39.70 31.20
MemoryOS (Li+25) System-level 38.00 29.10
LangMem System-level 34.30 25.70
AriGraph (Anokhin+25) Graph-based 33.70 26.20
A-Mem (Xu+25) Graph-based 33.30 26.20
MemGPT (Packer+24) System-level 28.00 20.50
LoCoMo (Maharana+24) Agentic 25.60 19.90

Baseline results sourced from [Jiang et al., 2026] for fair comparison on the same data split.

Performance by Query Type

Query Type EvoGraph SYNAPSE Delta
Multi-Hop (54.9%) 51.39% 35.70% +15.69 pp (+44.0%)
Temporal (20.9%) 46.21% 50.10% -3.89 pp
Single-Hop (18.4%) 31.55% 48.90% -17.35 pp
Open-Domain (5.8%) 14.55% 25.90% N/A

EvoGraph dominates multi-hop reasoning — the hardest and most prevalent query type (54.9% of benchmark) — by a +44.0% relative margin.


Core Idea

Most existing conversational memory systems (SYNAPSE, Zep, MemoryOS, etc.) adopt increasingly complex designs: multi-layer graph structures, spreading activation, diffusion mechanisms, OS-level memory management. The implicit assumption is that more structure = better performance.

We challenge this assumption. Our key insight:

In the era of powerful LLMs, the memory system's job is retrieval, not reasoning. Give the LLM the right context in its native format, and let it reason.

Complex systems:  Multi-layer graphs + Spreading activation + Structured output → LLM
EvoGraph:         Entity index + Co-occurrence retrieval + Raw dialogue text    → LLM

Three Design Principles (validated by ablation)

Principle Validated By Result
Quality > Quantity 10 entities vs 61 entities +29.78 pp F1
Raw > Structured Raw dialogue vs Structured timeline +0.99 pp F1 (single-hop: +3.64 pp, p=0.013)
Entity-Centric Sufficiency Entity-centric vs Relation traversal Comparable (no significant difference)

Architecture

┌─────────────────────────────────────────────────────────┐
│                    EvoGraph Pipeline                     │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  ┌──────────────┐    ┌───────────────┐    ┌──────────┐ │
│  │ Conversation  │───▶│    Entity     │───▶│  Neo4j   │ │
│  │   History     │    │  Extraction   │    │  Graph   │ │
│  │              │    │  (10 entities  │    │          │ │
│  │  [Turn 1]    │    │   per note)   │    │ Entity   │ │
│  │  [Turn 2]    │    │              │    │   ↕      │ │
│  │    ...       │    │  LLM-based   │    │  Note    │ │
│  │  [Turn N]    │    │  CRUD ops    │    │          │ │
│  └──────────────┘    └───────────────┘    └────┬─────┘ │
│                                                │       │
│  ┌──────────────┐    ┌───────────────┐    ┌────▼─────┐ │
│  │              │◀───│   Embedding   │◀───│ Entity   │ │
│  │   Answer     │    │   Reranking   │    │ Matching │ │
│  │  Generation  │    │              │    │          │ │
│  │              │    │ Cosine sim.  │    │ Query →  │ │
│  │  LLM + raw   │    │ Top-k notes  │    │ Entities │ │
│  │  dialogue    │    │              │    │          │ │
│  └──────────────┘    └───────────────┘    └──────────┘ │
│                                                         │
└─────────────────────────────────────────────────────────┘

Graph Structure (intentionally minimal):

(Entity) ──[:MENTIONED_IN {turn_id}]──▶ (Note)
(Note)   ──[:NEXT]──▶                   (Note)

No multi-layer hierarchies. No spreading activation. No state evolution chains. Just entities linked to the conversations that mention them.


Ablation Studies

Four systematic ablation experiments validate each design choice:

1. Reranking Strategy

Method F1 (%) Delta
Embedding Rerank (Ours) 44.82 baseline
LLM Rerank (GPT-4o-mini) 47.77 +2.95 (upper bound)
BM25 Rerank 43.10 -1.72
No Rerank 22.94 -21.88

Embedding-based reranking balances performance and reproducibility. LLM reranking provides further gains but introduces fairness concerns for comparison.

2. Entity Extraction Strategy

Strategy # Entities F1 (%) Delta
Quality-focused (Ours) 10 44.82 baseline
Quantity-focused 61 17.00 -27.82 (p<0.001)

Over-extraction introduces catastrophic noise.

3. Information Presentation

Format F1 (%) Delta
Raw dialogue (Ours) 47.77 baseline
Structured timeline (LineNode) 46.78 -0.99

Raw context preserves the nuance that LLMs need for reasoning. Single-hop queries show a significant improvement with raw format (+3.64 pp, p=0.013).

4. Retrieval Strategy

Strategy Recall (%) Delta
Entity-centric (Ours) 70.30 baseline
+ Relation traversal 66.72 -3.58

Entity co-occurrence provides sufficient retrieval signal without explicit relation edges.


Quick Start

Prerequisites

  • Python 3.9+
  • Neo4j 4.x+ (Docker recommended)
  • An OpenAI-compatible LLM API endpoint

Installation

git clone https://github.com/YOUR_USERNAME/EvoGraph.git
cd EvoGraph
pip install -r requirements.txt

Configuration

Copy the example config and fill in your credentials:

cp config.example.py config.py
# Edit config.py with your LLM API key and Neo4j credentials

Build the Knowledge Graph

python import_data.py

Run Evaluation

python evaluate.py

Use as a Library

from evograph_nolinenode import EvoGraphNoLineNode

evo = EvoGraphNoLineNode()

result = evo.answer("When did Jon lose his job as a banker?")
print(result["answer"])
# >>> "February, 2023"

evo.close()

Project Structure

├── evograph_nolinenode.py       # Main interface — retrieve + answer
├── graph_retriever.py           # Two-layer retrieval + embedding rerank
├── knowledge_graph_nolinenode.py # Neo4j graph operations (Entity ↔ Note)
├── entity_extractor.py          # LLM-based entity extraction (CRUD)
├── embedding_manager.py         # Sentence-Transformers wrapper
├── kg_builder_nolinenode.py     # Graph construction pipeline
├── llm_client.py                # LLM API client with load balancing
├── config.py                    # Configuration (LLM, Neo4j, thresholds)
├── import_data.py               # Data import entry point
├── evaluate.py                  # Evaluation harness (F1, BLEU, Recall)
└── eval_results/                # Benchmark results

Why It Works: Theoretical Framework

1. Information Bottleneck

Structured representations are lossy compression. Converting "I moved to New Haven last week because my new job is there" into {entity: "James", location: "New Haven", action: "moved"} loses the causal reasoning chain. LLMs need complete context.

2. Pre-training Distribution Alignment

LLMs are trained on natural text. Feeding them natural dialogue activates their strongest reasoning capabilities. Structured formats (JSON, timelines, graph tuples) are out-of-distribution.

3. Cognitive Load

Explicit structure (e.g., Timeline: James → Hartford (LIVES_IN) → New Haven (LIVES_IN, UPDATE)) forces the LLM to first parse the format, then reason. Raw text eliminates this overhead.


Benchmark

We evaluate on LoCoMo (Long-term Conversational Memory), which tests whether systems can answer questions requiring information scattered across hundreds of conversation turns. LoCoMo-10 contains 10 conversations with 1,540 question-answer pairs across four categories:

  • Multi-hop (54.9%): Requires reasoning across multiple conversation turns
  • Temporal (20.9%): Requires temporal reasoning about events
  • Single-hop (18.4%): Simple fact lookup from a single turn
  • Open-domain (5.8%): General knowledge grounded in conversation

Complexity Comparison

System Graph Structure Retrieval Lines of Code F1 (%)
EvoGraph Single-layer (Entity-Note) Entity matching ~1,500 44.82
SYNAPSE Three-layer (perception/semantic/event) Spreading activation + Lateral inhibition ~5,000 40.50
Zep Temporal knowledge graph Temporal traversal ~3,000 39.70

Citation

@misc{evograph2026,
  title={EvoGraph: Less is More for LLM Long-Term Conversational Memory},
  year={2026},
  note={Under review}
}

License

MIT License

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors