fix: dedup punctuation + ingest model server startup (#305, #337) by buildingjoshbetter · Pull Request #361 · buildingjoshbetter/TrueMemory

buildingjoshbetter · 2026-05-21T21:51:15Z

Summary

Dedup Jaccard tokenization: punctuation breaks word matching #305 (P3): _word_overlap() Jaccard tokenization uses str.split() which keeps punctuation attached ("Austin," != "Austin"). Fixed with re.findall(r'\w+', ...).
Ingest process loads 1.5GB model for prediction error when it could be skipped or remote #337 (P3): Ingest subprocess (_run_ingest() in cli.py) never calls ensure_server_running(), so the encoding gate's PE computation falls back to loading the full Qwen3 model (~1.5GB) locally. Fixed by adding ensure_server_running() call at the top of _run_ingest().

Files changed

truememory/ingest/dedup.py — re.findall tokenization, module-level import re, removed redundant local import
truememory/ingest/cli.py — ensure_server_running() call in _run_ingest()

Files NOT changed (intentionally)

truememory/ingest/encoding_gate.py — PE computation fully preserved, gate weights unchanged
truememory/ingest/hooks/stop.py — Popen untouched, spawn_gate untouched

What this does NOT do

Does NOT skip prediction error (PE is essential, w_prediction_error=0.30)
Does NOT add TRUEMEMORY_GATE_NO_PE env var
Does NOT modify encoding gate weights or threshold
Does NOT touch stop.py or spawn_gate

Test evidence

Custom validation tests: 16/16 passed (punctuation stripping, Jaccard correctness, PE preservation, file integrity)
Full test suite: 109 passed, 1 failed (pre-existing test_stop_hook_safety unrelated to these changes), 1 skipped
Sacred parameters: verified unchanged (gate threshold 0.30, weights 0.25/0.20/0.30, dedup threshold 0.92)
encoding_gate.py: verified zero diff
stop.py: verified zero diff
Ruff lint: clean
Pre-fix consensus: 4/5 APPROVE (Grok API timeout)
Post-fix consensus: 4/5 APPROVE (Grok API timeout)

Risk

Bug 1: \w+ strips ALL non-word chars including hyphens in hyphenated words. Jaccard is a coarse heuristic — this is acceptable.
Bug 2: ensure_server_running() may add ~1-2 seconds if it needs to start the model server. This is a one-time cost that saves 10-30 seconds of local model loading. Wrapped in try/except so failure is non-fatal.

Closes #305, closes #337

1. dedup.py: replace str.split() with re.findall(r'\w+', ...) in _word_overlap() so punctuation is stripped before Jaccard comparison. "Austin," vs "Austin" was 0% match, now correctly 100%. 2. cli.py: add ensure_server_running() call at the top of _run_ingest() so ingest subprocesses use the shared model server instead of falling back to loading Qwen3-Embedding-0.6B (~1.5GB) locally. The encoding gate and prediction error are UNCHANGED — this just ensures the model server is alive before the gate needs it. Sacred parameters verified unchanged. encoding_gate.py and stop.py verified untouched. 5-model consensus: 4/5 APPROVE (1 API timeout) Closes #305, closes #337

buildingjoshbetter merged commit 202aa86 into main May 21, 2026
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: dedup punctuation + ingest model server startup (#305, #337)#361

fix: dedup punctuation + ingest model server startup (#305, #337)#361
buildingjoshbetter merged 1 commit into
mainfrom
fix/s03-ingest-pipeline

buildingjoshbetter commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

buildingjoshbetter commented May 21, 2026

Summary

Files changed

Files NOT changed (intentionally)

What this does NOT do

Test evidence

Risk

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant