Skip to content

fix: FAISS index/docstore race condition (KeyError crash)#1542

Open
gdeyoung wants to merge 1 commit into
agent0ai:mainfrom
gdeyoung:fix/faiss-race-condition-017
Open

fix: FAISS index/docstore race condition (KeyError crash)#1542
gdeyoung wants to merge 1 commit into
agent0ai:mainfrom
gdeyoung:fix/faiss-race-condition-017

Conversation

@gdeyoung
Copy link
Copy Markdown
Contributor

Problem

KeyError: 56397 crash in similarity_search_with_score_by_vector when accessing index_to_docstore_id[i]. The FAISS index accumulates more vectors than the mapping dict, causing all memory searches to fail.

Root Cause

Race condition in the shared MyFaiss object (class variable Memory.index). When multiple concurrent agents write to FAISS simultaneously (e.g., interactive agent + scheduler tasks):

  1. Agent A calls aadd_documents() → adds vector to FAISS index (ntotal increments)
  2. Agent B calls aadd_documents() → adds another vector
  3. Agent A updates index_to_docstore_id mapping and calls save_local()
  4. save_local() writes index.faiss with BOTH vectors but index.pkl with only Agent A's mapping entry
  5. Any search hitting the unmapped vector → KeyError

Evidence

Observed the index grow from 56,398 → 56,427 (29 new vectors) during a 10-minute debugging session, with 1 orphaned mapping per concurrent write. Same desync recurred on consecutive days.

Fix (3 layers, defense in depth)

Layer Fix Purpose
Prevention threading.Lock on insert_documents/update_documents Serializes concurrent FAISS mutations
Recovery Auto-repair on KeyError in both search methods Links orphan docstore entries to unmapped index positions
Integrity Atomic saves via temp dir + os.replace() Prevents split-brain on crash

Changes

  • New imports: threading, tempfile
  • New class variable: Memory._write_lock: dict[str, threading.Lock]
  • New method: _repair_faiss_desync() — finds unmapped index positions and orphan docstore entries, links them
  • Modified: insert_documents, update_documents — wrapped with lock
  • Modified: search_similarity_threshold, search_similarity_threshold_with_scores — try/except KeyError with auto-repair
  • Modified: _save_db_file — atomic write via temp dir

Testing

  • Syntax validated
  • Repair tested on live index (56K+ vectors)
  • Search verified working after repair
  • Desync confirmed recurring before fix, stopped after lock applied

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant