Skip to content

fix: dedup punctuation + ingest model server startup (#305, #337)#361

Merged
buildingjoshbetter merged 1 commit into
mainfrom
fix/s03-ingest-pipeline
May 21, 2026
Merged

fix: dedup punctuation + ingest model server startup (#305, #337)#361
buildingjoshbetter merged 1 commit into
mainfrom
fix/s03-ingest-pipeline

Conversation

@buildingjoshbetter
Copy link
Copy Markdown
Owner

Summary

Files changed

  • truememory/ingest/dedup.pyre.findall tokenization, module-level import re, removed redundant local import
  • truememory/ingest/cli.pyensure_server_running() call in _run_ingest()

Files NOT changed (intentionally)

  • truememory/ingest/encoding_gate.py — PE computation fully preserved, gate weights unchanged
  • truememory/ingest/hooks/stop.py — Popen untouched, spawn_gate untouched

What this does NOT do

  • Does NOT skip prediction error (PE is essential, w_prediction_error=0.30)
  • Does NOT add TRUEMEMORY_GATE_NO_PE env var
  • Does NOT modify encoding gate weights or threshold
  • Does NOT touch stop.py or spawn_gate

Test evidence

  • Custom validation tests: 16/16 passed (punctuation stripping, Jaccard correctness, PE preservation, file integrity)
  • Full test suite: 109 passed, 1 failed (pre-existing test_stop_hook_safety unrelated to these changes), 1 skipped
  • Sacred parameters: verified unchanged (gate threshold 0.30, weights 0.25/0.20/0.30, dedup threshold 0.92)
  • encoding_gate.py: verified zero diff
  • stop.py: verified zero diff
  • Ruff lint: clean
  • Pre-fix consensus: 4/5 APPROVE (Grok API timeout)
  • Post-fix consensus: 4/5 APPROVE (Grok API timeout)

Risk

  • Bug 1: \w+ strips ALL non-word chars including hyphens in hyphenated words. Jaccard is a coarse heuristic — this is acceptable.
  • Bug 2: ensure_server_running() may add ~1-2 seconds if it needs to start the model server. This is a one-time cost that saves 10-30 seconds of local model loading. Wrapped in try/except so failure is non-fatal.

Closes #305, closes #337

1. dedup.py: replace str.split() with re.findall(r'\w+', ...) in
   _word_overlap() so punctuation is stripped before Jaccard comparison.
   "Austin," vs "Austin" was 0% match, now correctly 100%.

2. cli.py: add ensure_server_running() call at the top of _run_ingest()
   so ingest subprocesses use the shared model server instead of falling
   back to loading Qwen3-Embedding-0.6B (~1.5GB) locally. The encoding
   gate and prediction error are UNCHANGED — this just ensures the model
   server is alive before the gate needs it.

Sacred parameters verified unchanged.
encoding_gate.py and stop.py verified untouched.
5-model consensus: 4/5 APPROVE (1 API timeout)

Closes #305, closes #337
@buildingjoshbetter buildingjoshbetter merged commit 202aa86 into main May 21, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ingest process loads 1.5GB model for prediction error when it could be skipped or remote Dedup Jaccard tokenization: punctuation breaks word matching

1 participant